/[thuban]/branches/WIP-pyshapelib-bramz/Doc/technotes/string_representation.txt
ViewVC logotype

Annotation of /branches/WIP-pyshapelib-bramz/Doc/technotes/string_representation.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 2734 - (hide annotations)
Thu Mar 1 12:42:59 2007 UTC (18 years ago) by bramz
File MIME type: text/plain
File size: 3703 byte(s)
made a copy
1 bh 2642 Title: String Representation in Thuban
2     Author: Bernhard Herzog <[email protected]>
3     Last-Modified: $Date$
4     Version: $Revision$
5    
6    
7     Introduction
8    
9     Thuban originally assumed that text is represented by byte-strings
10     encoded in ISO-8859-1 (latin-1). This is problematic when the
11     default encoding in the user's locale is not in fact latin-1, but
12     e.g. UTF-8. The solution is to use a more flexible representation
13     that will also allow the switch to Unicode as the internal string
14     representation at one point.
15    
16    
17     Internal String Representation
18    
19     Thuban has an internal string representation. All textual data read
20     by Thuban has to be converted to the internal representation. All
21     data written by Thuban has to be converted into whatever form is
22     used by the output device.
23    
24     Thuban provides functions to convert between the internal
25     representation and other representations. E.g.:
26     internal_from_unicode which converts from unicode and should be used
27     when reading XML files, for instance and unicode_from_internal for
28     the conversion to Unicode.
29    
30     The ultimate goal is to use Unicode objects as the internal string
31     representation. It will be much work to get there because we will
32     have to find all the places where we need to make the conversions.
33     Therefore the internal representation will be byte strings in the
34     user's default encoding.
35    
36     With byte strings and especially encodings like latin-1 we can get
37     by without doing all the conversions correctly because basically all
38     byte strings are valid latin-1 strings, even if they have the wrong
39     encoding. In those cases, the text may look strange, but there
40     won't be exceptions in most cases. With Unicode objects, exceptions
41     are much more likely. And in the end it's better to see some
42     incorrect characters than no data at all.
43    
44     All this boils down to the following steps:
45    
46     1. Byte-Strings as Internal Representation
47    
48     The internal representation are byte strings in the user's default
49     encoding as determined by the locale. The encoding is chosen so
50     that such byte strings can be passed to wxPython without problems.
51     This even works with Unicode builds if we take care to convert the
52     translated strings (wxGetTranslation returns Unicode objects in a
53     Unicode build).
54    
55     If no suitable encoding can be determined, use latin-1. It might be
56     better to use ASCII instead, but latin 1 offers somewhat better
57     backwards compatibility with older Thuban versions.
58    
59     Start implementing the conversion functions and use them wherever
60     we have hard coded conversions to latin-1. It's not necessary to
61     find all places where conversion has to be done at this point.
62     Since we're using byte strings in the user's default encoding most
63     byte-strings that are read by Thuban are already in the right form
64     and in most cases it's also the right form for output.
65    
66    
67     2. Implement the conversion wherever necessary
68    
69     Start working toward Unicode as the internal representation. In
70     this phase, we need to find all places where conversion has to be
71     done. To help with this, there will be a command line option that
72     sets the internal representation to Unicode so that it's easy to
73     test.
74    
75     The most difficult areas for this are probably the various data
76     sources. Some of them -- dbf files for instance -- q don't provide
77     any information about the encodings used.
78    
79    
80     3. Switch to Unicode
81    
82     Finally, switch to Unicode as the internal string representation.
83     For this step it might be best to wait until Unicode builds of
84     wxPython are the default on the common platforms.
85    

Properties

Name Value
svn:eol-style native
svn:keywords Author Date Id Revision

[email protected]
ViewVC Help
Powered by ViewVC 1.1.26