/[thuban]/branches/WIP-pyshapelib-bramz/Doc/technotes/string_representation.txt
ViewVC logotype

Annotation of /branches/WIP-pyshapelib-bramz/Doc/technotes/string_representation.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 2642 - (hide annotations)
Fri Jul 1 20:49:04 2005 UTC (19 years, 8 months ago) by bh
Original Path: trunk/thuban/Doc/technotes/string_representation.txt
File MIME type: text/plain
File size: 3703 byte(s)
First step towards unicode.  With this roughly we're at step 1
string_representation.txt

* Doc/technotes/string_representation.txt: New.  Document how
strings are represented in Thuban and how to get to a Unicode
Thuban.

* Thuban/__init__.py (set_internal_encoding)
(unicode_from_internal, internal_from_unicode): New. The first few
functions for the internal string representation

* Thuban/UI/about.py (unicodeToLocale): Removed.  Use
internal_from_unicode instead.

* Thuban/UI/__init__.py (install_wx_translation): Determine the
encoding to use for the internal string representation.  Also,
change the translation function to return strings in internal
representation even on unicode builds of wxPython

* Thuban/Model/load.py (SessionLoader.check_attrs): Decode
filenames too.
(SessionLoader.start_clrange): Use check_attrs to decode and check
the attributes.

* Thuban/Model/xmlreader.py (XMLReader.encode): Use
internal_from_unicode to convert unicode strings.

* Thuban/Model/xmlwriter.py (XMLWriter.encode): Use
unicode_from_internal when applicable

* test/runtests.py (main): New command line option:
internal-encoding to specify the internal string encoding to use
in the tests.

* test/support.py (initthuban): Set the internal encoding to
latin-1

* test/test_load.py (TestSingleLayer.test, TestClassification.test)
(TestLabelLayer.test): Use the internal string representation when
dealing with non-ascii characters

* test/test_load_1_0.py (TestSingleLayer.test)
(TestClassification.test, TestLabelLayer.test): Use the internal
string representation when dealing with non-ascii characters

* test/test_load_0_9.py (TestSingleLayer.test)
(TestClassification.test): Use the internal string representation
when dealing with non-ascii characters

* test/test_load_0_8.py (TestUnicodeStrings.test): Use the
internal string representation when dealing with non-ascii
characters

* test/test_save.py (XMLWriterTest.testEncode)
(SaveSessionTest.testClassifiedLayer): Use the internal string
representation when dealing with non-ascii characters where
applicable

1 bh 2642 Title: String Representation in Thuban
2     Author: Bernhard Herzog <[email protected]>
3     Last-Modified: $Date$
4     Version: $Revision$
5    
6    
7     Introduction
8    
9     Thuban originally assumed that text is represented by byte-strings
10     encoded in ISO-8859-1 (latin-1). This is problematic when the
11     default encoding in the user's locale is not in fact latin-1, but
12     e.g. UTF-8. The solution is to use a more flexible representation
13     that will also allow the switch to Unicode as the internal string
14     representation at one point.
15    
16    
17     Internal String Representation
18    
19     Thuban has an internal string representation. All textual data read
20     by Thuban has to be converted to the internal representation. All
21     data written by Thuban has to be converted into whatever form is
22     used by the output device.
23    
24     Thuban provides functions to convert between the internal
25     representation and other representations. E.g.:
26     internal_from_unicode which converts from unicode and should be used
27     when reading XML files, for instance and unicode_from_internal for
28     the conversion to Unicode.
29    
30     The ultimate goal is to use Unicode objects as the internal string
31     representation. It will be much work to get there because we will
32     have to find all the places where we need to make the conversions.
33     Therefore the internal representation will be byte strings in the
34     user's default encoding.
35    
36     With byte strings and especially encodings like latin-1 we can get
37     by without doing all the conversions correctly because basically all
38     byte strings are valid latin-1 strings, even if they have the wrong
39     encoding. In those cases, the text may look strange, but there
40     won't be exceptions in most cases. With Unicode objects, exceptions
41     are much more likely. And in the end it's better to see some
42     incorrect characters than no data at all.
43    
44     All this boils down to the following steps:
45    
46     1. Byte-Strings as Internal Representation
47    
48     The internal representation are byte strings in the user's default
49     encoding as determined by the locale. The encoding is chosen so
50     that such byte strings can be passed to wxPython without problems.
51     This even works with Unicode builds if we take care to convert the
52     translated strings (wxGetTranslation returns Unicode objects in a
53     Unicode build).
54    
55     If no suitable encoding can be determined, use latin-1. It might be
56     better to use ASCII instead, but latin 1 offers somewhat better
57     backwards compatibility with older Thuban versions.
58    
59     Start implementing the conversion functions and use them wherever
60     we have hard coded conversions to latin-1. It's not necessary to
61     find all places where conversion has to be done at this point.
62     Since we're using byte strings in the user's default encoding most
63     byte-strings that are read by Thuban are already in the right form
64     and in most cases it's also the right form for output.
65    
66    
67     2. Implement the conversion wherever necessary
68    
69     Start working toward Unicode as the internal representation. In
70     this phase, we need to find all places where conversion has to be
71     done. To help with this, there will be a command line option that
72     sets the internal representation to Unicode so that it's easy to
73     test.
74    
75     The most difficult areas for this are probably the various data
76     sources. Some of them -- dbf files for instance -- q don't provide
77     any information about the encodings used.
78    
79    
80     3. Switch to Unicode
81    
82     Finally, switch to Unicode as the internal string representation.
83     For this step it might be best to wait until Unicode builds of
84     wxPython are the default on the common platforms.
85    

Properties

Name Value
svn:eol-style native
svn:keywords Author Date Id Revision

[email protected]
ViewVC Help
Powered by ViewVC 1.1.26