| View previous topic :: View next topic |
| Author |
Message |
Jackil
Joined: 24 May 2006 Posts: 97
|
| Hi guys. I got some problems with parsing a xml file while running a script on both linux and windows. Should I save the file as utf-8 on both platforms? |
| |
|
|
|
|
BigDaddy
Joined: 26 May 2006 Posts: 147
|
Jackil, what is the encoding of the xml file? what is giving you the option of saving as another encoding
And yes utf-8 should be OK as long as you read it back in as utf-8, and the xml parser accepts that |
| |
|
|
Jackil
Joined: 24 May 2006 Posts: 97
|
| BigDaddy, hmm.. the encoding is specified as utf-8, but the file is fetched from a webserver though. |
| |
|
|
BigDaddy
Joined: 26 May 2006 Posts: 147
|
| How are you fetching it? |
| |
|
|
Jackil
Joined: 24 May 2006 Posts: 97
|
| I use urllib.urlopen |
| |
|
|
BigDaddy
Joined: 26 May 2006 Posts: 147
|
| The result is a string, right? |
| |
|
|
Jackil
Joined: 24 May 2006 Posts: 97
|
Well, a filehandle
Which I then have created a small xml parser for |
| |
|
|
BigDaddy
Joined: 26 May 2006 Posts: 147
|
| When reading it in, you might want to get it into unicode with stringjustread.decode('utf-8') |
| |
|
|
Jackil
Joined: 24 May 2006 Posts: 97
|
| Hmm.. interesting. |
| |
|
|
BigDaddy
Joined: 26 May 2006 Posts: 147
|
| To answer the original question, whatever worked on windows should work on linux, except that the system default encoding for 8-bit strings might be "windows-1252" and not "ascii" |
| |
|
|
Jackil
Joined: 24 May 2006 Posts: 97
|
the system default encoding on window here was ascii
I actually got some problems when parsing the xml-file, because it also contains a string to save another file. |
| |
|
|
|
|
BigDaddy
Joined: 26 May 2006 Posts: 147
|
Huh? what do you mean "a string to save another file"
Embedded python code? a xml reference to another file? |
| |
|
|
Jackil
Joined: 24 May 2006 Posts: 97
|
sorry, one of the fields contains a filename
Which might contain characters not found in ascii |
| |
|
|
BigDaddy
Joined: 26 May 2006 Posts: 147
|
| How do you want to handle those? will your small homebuilt xml parser accept unicode strings? |
| |
|
|
Jackil
Joined: 24 May 2006 Posts: 97
|
| I have no idea, I'm using xml.saxtils to create an xml handler. But I'm really confused if I should enforce utf-8 to avoid problems when both storing and reading the file. |
| |
|
|
BigDaddy
Joined: 26 May 2006 Posts: 147
|
| The transition to unicode is a pain in the ass. luckily im an american and can just ignore it |
| |
|
|
Jackil
Joined: 24 May 2006 Posts: 97
|
BigDaddy, I'm really just looking for a solution to both be able to store filenames with local characters on both linux and windows, and the be able get the names back in a readable form.
But this whole character thing really confuses me... |
| |
|
|
BigDaddy
Joined: 26 May 2006 Posts: 147
|
Jackil, if the xml file is coming back as utf-8 you need to be aware of that and read it in as utf-8 (by reading the string and then getting a new unicode string from that with .decode('utf-8'))
Jackil, or verify (or teach) the xml parser knows utf-8 (i doubt it) |
| |
|
|
Jackil
Joined: 24 May 2006 Posts: 97
|
| BigDaddy, oki, the filesystem on linux is using utf-8. But would it be easier to just let the web-server enforce iso? |
| |
|
|
BigDaddy
Joined: 26 May 2006 Posts: 147
|
The other issue is, is the other end really putting utf-8 into the xml file, or is it just the local filename in whatever encoding the filesystem or whatever generated the xml is using
Because if the server is SAYING its utf-8 but its actually Latin-1 in the fields, you are in trouble ;) |
| |
|
|
Jackil
Joined: 24 May 2006 Posts: 97
|
| yeah, I got a lot of errors with filenames and some Unicode coerce stuff |
| |
|
|
|
|
BigDaddy
Joined: 26 May 2006 Posts: 147
|
1. make sure the xml file is really utf8 by trying to decode it into Unicode with .decode('utf-8')... if this works there are at least no utf-8 errors
2. either convert the unicode to ascii with .encode('ascii', 'xmlcharrefreplace') to replace funny chars with &uXXXX;
Or 3. run the unicode through the xml parser and hope it works
Or 4. convert it to nationalized 8-bit text with .encode('latin-1') or somesuch, then run it through, then convert back to utf-8 later (maybe with .decode('latin-1').encode('utf-8')) |
| |
|
|
Jackil
Joined: 24 May 2006 Posts: 97
|
| thank you :) I will surely investigate this. |
| |
|
|
BigDaddy
Joined: 26 May 2006 Posts: 147
| |
Jackil
Joined: 24 May 2006 Posts: 97
|
| yeah, been looking at those urls lately. Just one more question, do you know of a certain way to identify what encoding a file has? |
| |
|
|
| Page 1 of 2 |
Goto page 1, 2 Next |
|