redho home | products | services
Web Design Forums

Web Design Forums  


Web Hosting, Web Design, Software and Web Development Forums  
 FAQFAQ   MemberlistArchive  Log inLog in   RegisterRegister 
         

troubles with unicode


Goto page 1, 2  Next
 
Post new topic   Reply to topic    Web Design Forums -> Python programming forum
View previous topic :: View next topic  
Author Message
Jackil



Joined: 24 May 2006
Posts: 97
Hi guys. I got some problems with parsing a xml file while running a script on both linux and windows. Should I save the file as utf-8 on both platforms?
  Reply with quote


BigDaddy



Joined: 26 May 2006
Posts: 147
Jackil, what is the encoding of the xml file? what is giving you the option of saving as another encoding
And yes utf-8 should be OK as long as you read it back in as utf-8, and the xml parser accepts that
  Reply with quote
Jackil



Joined: 24 May 2006
Posts: 97
BigDaddy, hmm.. the encoding is specified as utf-8, but the file is fetched from a webserver though.
  Reply with quote
BigDaddy



Joined: 26 May 2006
Posts: 147
How are you fetching it?
  Reply with quote
Jackil



Joined: 24 May 2006
Posts: 97
I use urllib.urlopen
  Reply with quote
BigDaddy



Joined: 26 May 2006
Posts: 147
The result is a string, right?
  Reply with quote
Jackil



Joined: 24 May 2006
Posts: 97
Well, a filehandle
Which I then have created a small xml parser for
  Reply with quote
BigDaddy



Joined: 26 May 2006
Posts: 147
When reading it in, you might want to get it into unicode with stringjustread.decode('utf-8')
  Reply with quote
Jackil



Joined: 24 May 2006
Posts: 97
Hmm.. interesting.
  Reply with quote
BigDaddy



Joined: 26 May 2006
Posts: 147
To answer the original question, whatever worked on windows should work on linux, except that the system default encoding for 8-bit strings might be "windows-1252" and not "ascii"
  Reply with quote
Jackil



Joined: 24 May 2006
Posts: 97
the system default encoding on window here was ascii
I actually got some problems when parsing the xml-file, because it also contains a string to save another file.
  Reply with quote


BigDaddy



Joined: 26 May 2006
Posts: 147
Huh? what do you mean "a string to save another file"
Embedded python code? a xml reference to another file?
  Reply with quote
Jackil



Joined: 24 May 2006
Posts: 97
sorry, one of the fields contains a filename
Which might contain characters not found in ascii
  Reply with quote
BigDaddy



Joined: 26 May 2006
Posts: 147
How do you want to handle those? will your small homebuilt xml parser accept unicode strings?
  Reply with quote
Jackil



Joined: 24 May 2006
Posts: 97
I have no idea, I'm using xml.saxtils to create an xml handler. But I'm really confused if I should enforce utf-8 to avoid problems when both storing and reading the file.
  Reply with quote
BigDaddy



Joined: 26 May 2006
Posts: 147
The transition to unicode is a pain in the ass. luckily im an american and can just ignore it
  Reply with quote
Jackil



Joined: 24 May 2006
Posts: 97
BigDaddy, I'm really just looking for a solution to both be able to store filenames with local characters on both linux and windows, and the be able get the names back in a readable form.
But this whole character thing really confuses me...
  Reply with quote
BigDaddy



Joined: 26 May 2006
Posts: 147
Jackil, if the xml file is coming back as utf-8 you need to be aware of that and read it in as utf-8 (by reading the string and then getting a new unicode string from that with .decode('utf-8'))
Jackil, or verify (or teach) the xml parser knows utf-8 (i doubt it)
  Reply with quote
Jackil



Joined: 24 May 2006
Posts: 97
BigDaddy, oki, the filesystem on linux is using utf-8. But would it be easier to just let the web-server enforce iso?
  Reply with quote
BigDaddy



Joined: 26 May 2006
Posts: 147
The other issue is, is the other end really putting utf-8 into the xml file, or is it just the local filename in whatever encoding the filesystem or whatever generated the xml is using
Because if the server is SAYING its utf-8 but its actually Latin-1 in the fields, you are in trouble ;)
  Reply with quote
Jackil



Joined: 24 May 2006
Posts: 97
yeah, I got a lot of errors with filenames and some Unicode coerce stuff
  Reply with quote


BigDaddy



Joined: 26 May 2006
Posts: 147
1. make sure the xml file is really utf8 by trying to decode it into Unicode with .decode('utf-8')... if this works there are at least no utf-8 errors
2. either convert the unicode to ascii with .encode('ascii', 'xmlcharrefreplace') to replace funny chars with &uXXXX;
Or 3. run the unicode through the xml parser and hope it works
Or 4. convert it to nationalized 8-bit text with .encode('latin-1') or somesuch, then run it through, then convert back to utf-8 later (maybe with .decode('latin-1').encode('utf-8'))
  Reply with quote
Jackil



Joined: 24 May 2006
Posts: 97
thank you :) I will surely investigate this.
  Reply with quote
BigDaddy



Joined: 26 May 2006
Posts: 147
Http://www.reportlab.com/i18n/python_unicode_tutorial.html
Http://www.jorendorff.com/articles/unicode/python.html
  Reply with quote
Jackil



Joined: 24 May 2006
Posts: 97
yeah, been looking at those urls lately. Just one more question, do you know of a certain way to identify what encoding a file has?
  Reply with quote
Page 1 of 2 Goto page 1, 2  Next
Post new topic   Reply to topic    Web Design Forums -> Python programming forum


Dubai Forums - Expat Help | Vegan Chat | Java Programming | Free 3D tutorials and 3d textures | Paris Forum | EU Forum
Free Dubai Classifieds | Free London Classifieds | Jobs in London

High Quality, Custom 3d animation and Web Design solutions Royal Quality Web Hosting Services Vegetarian and Animal Rights news

Powered by phpBB © 2001, 2005 phpBB Group