| View previous topic :: View next topic |
| Author |
Message |
Jackil
Joined: 24 May 2006 Posts: 97
|
| yeah, I got a lot of errors with filenames and some Unicode coerce stuff |
| |
|
|
|
|
BigDaddy
Joined: 26 May 2006 Posts: 147
|
1. make sure the xml file is really utf8 by trying to decode it into Unicode with .decode('utf-8')... if this works there are at least no utf-8 errors
2. either convert the unicode to ascii with .encode('ascii', 'xmlcharrefreplace') to replace funny chars with &uXXXX;
Or 3. run the unicode through the xml parser and hope it works
Or 4. convert it to nationalized 8-bit text with .encode('latin-1') or somesuch, then run it through, then convert back to utf-8 later (maybe with .decode('latin-1').encode('utf-8')) |
| |
|
|
Jackil
Joined: 24 May 2006 Posts: 97
|
| thank you :) I will surely investigate this. |
| |
|
|
BigDaddy
Joined: 26 May 2006 Posts: 147
| |
Jackil
Joined: 24 May 2006 Posts: 97
|
| yeah, been looking at those urls lately. Just one more question, do you know of a certain way to identify what encoding a file has? |
| |
|
|
BigDaddy
Joined: 26 May 2006 Posts: 147
|
That is a hard problem.... if it has 8-bit data in it and the utf-8 decoder works, its probably utf-8
If there is no 8-bit data it is most likely ascii |
| |
|
|
Jackil
Joined: 24 May 2006 Posts: 97
|
| Oki, I see |
| |
|
|
BigDaddy
Joined: 26 May 2006 Posts: 147
|
If it has a unicode byte order mark (0xFFFE or 0xFEFF) at the beginning it is probably raw UCS-16
If you are in norway and its not utf-8 and not ascii, it might be iso-8859-15
Or who knows, it might be chinese in some two-byte encoding |
| |
|
|
Jackil
Joined: 24 May 2006 Posts: 97
|
| Right... I just opened a can of worms ;) |
| |
|
|
|