Byte String Vs. Unicode String. Python


Answer :

No python does not use its own encoding. It will use any encoding that it has access to and that you specify. A character in a str represents one unicode character. However to represent more than 256 characters, individual unicode encodings use more than one byte per character to represent many characters. bytearray objects give you access to the underlaying bytes. str objects have the encode method that takes a string representing an encoding and returns the bytearray object that represents the string in that encoding. bytearray objects have the decode method that takes a string representing an encoding and returns the str that results from interpreting the bytearray as a string encoded in the the given encoding. Here's an example.

>>> a = "αά".encode('utf-8') >>> a b'\xce\xb1\xce\xac' >>> a.decode('utf-8') 'αά' 

We can see that UTF-8 is using four bytes, \xce, \xb1, \xce, and \xac to represent two characters. After the Spolsky article that Ignacio Vazquez-Abrams referred to, I would read the Python Unicode Howto.


Here's an attempt at a simple explanation that only applies to Python 3. I hope that coming from a lay person, it would help to clear some confusion for the completely uninitiated. If there are any technical inaccuracies, please forgive me and feel free to point it out.

Suppose you create a string using Python 3 in the usual way:

stringobject = 'ant' 

stringobject would be a unicode string.

A unicode string is made up of unicode characters. In stringobject above, the unicode characters are the individual letters, e.g. a, n, t

Each unicode character is assigned a code point, which can be expressed as a sequence of hex digits (a hex digit can take on 16 values, ranging from 0-9 and A-F). For instance, the letter 'a' is equivalent to '\u0061', and 'ant' is equivalent to '\u0061\u006E\u0074'.

So you will find that if you type in,

stringobject = '\u0061\u006E\u0074' stringobject 

You will also get the output 'ant'.

Now, unicode is converted to bytes, in a process known as encoding. The reverse process of converting bytes to unicode is known as decoding.

How is this done? Since each hex digit can take on 16 different values, it can be reflected in a 4-bit binary sequence (e.g. the hex digit 0 can be expressed in binary as 0000, the hex digit 1 can be expressed as 0001 and so forth). If a unicode character has a code point consisting of four hex digits, it would need a 16-bit binary sequence to encode it.

Different encoding systems specify different rules for converting unicode to bits. Most importantly, encodings differ in the number of bits they use to express each unicode character.

For instance, the ASCII encoding system uses only 8 bits (1 byte) per character. Thus it can only encode unicode characters with code points up to two hex digits long (i.e. 256 different unicode characters). The UTF-8 encoding system uses 8 to 32 bits (1 to 4 bytes) per character, so it can encode unicode characters with code points up to 8 hex digits long, i.e. everything.

Running the following code:

byteobject = stringobject.encode('utf-8') byteobject, type(byteobject) 

converts a unicode string into a byte string using the utf-8 encoding system, and returns b'ant', bytes'.

Note that if you used 'ASCII' as the encoding system, you wouldn't run into any problems since all code points in 'ant' can be expressed with 1 byte. But if you had a unicode string containing characters with code points longer than two hex digits, you would get a UnicodeEncodeError.

Similarly,

stringobject = byteobject.decode('utf-8') stringobject, type(stringobject) 

gives you 'ant', str.


Comments

Popular posts from this blog

Chemistry - Bond Angles In NH3 And NCl3

Are Regular VACUUM ANALYZE Still Recommended Under 9.1?

Change The Font Size Of Visual Studio Solution Explorer