CHARACTER CODE
Each computer has a set of characters that it uses. As a bare minimum, this set includes the 26 uppercase letter, the 26 lowercase letters, the digits 0 through 9, and a set of special symbols, such as space, period, minus sign, comma, and carriage turn.
In order to transfer these character into the computer, each one is assigned a number : for example, a=1 , b=2, ....., z=26, +=27, -=28. The mapping of character onto integers is called a character code.
THREE IMPORTANT CHARACTER CODE
- ASCII
- UNICODE
- UTF-8
ASCII (American Standard Code for Information Interchange)
Each ASCII code has :
- 7 bits
- 128 characters
standard ASCII Table |
The ASCII printing characters are straightforward.They includes the upper and lowercase letters,digits,,punctuation marks, and few math symbols.
Unicode
An example of Unicode table |
ASCII is fine for English but less fine for other languages. French needs some accents (e.g, système) ; German needs diacritical marks (e.g für) and so on.Some European language have a few letters not found in ASCII, such as the German β and the Danish Φ. Some language have entirely different alphabets (e.g Russian and Arabic), and few languages have no alphabets at all (e.g Chinese) . As computers spread to the four corners of the globe and software vendors want to sell products in countries where most users does not speak English, a different characters sets needed.
1st attempt - extending ASCII was IS 646, which added another 128 characters to ASCII
- making it 8-bit code call Latin-1. The additional characters were mostly Latin letters with accents and diacritical marks.
2nd attempt - IS 8859, introduced the concept of a code page
- a set of 256 characters for a particulars language or group of language
- IS 8859-1 is Latin-1
- IS 8859-2 handles the Latin-based Slavic languages (e.g Republic Czech, Polish and Hungarian).
- IS 8859-3 contains the characters need for Turkish, Maltese, Esperanto,Galician and so on.
The trouble with the code-page approach is that the software has to keep track of which page it is currently on, it is impossible to mix language over pages, and the scheme does not cover Japanese and Chinese at all.
A group of computers companies decided to solve this problem by forming a consortium to create new system, called Unicode, and getting it proclaimed an International Standard (IS 10646). Unicode now supported by programming language (e.g Java), operating system (e.g Windows), and many applications.
The idea behind Unicode is to assign every characters and symbols a unique 16-bit value, called code point.
No multibyte characters or escape sequence are used.Having every symbol be 16 bits makes writing software simpler.
UNICODE
what make UTF-8 nice is :
1st attempt - extending ASCII was IS 646, which added another 128 characters to ASCII
- making it 8-bit code call Latin-1. The additional characters were mostly Latin letters with accents and diacritical marks.
2nd attempt - IS 8859, introduced the concept of a code page
- a set of 256 characters for a particulars language or group of language
- IS 8859-1 is Latin-1
- IS 8859-2 handles the Latin-based Slavic languages (e.g Republic Czech, Polish and Hungarian).
- IS 8859-3 contains the characters need for Turkish, Maltese, Esperanto,Galician and so on.
The trouble with the code-page approach is that the software has to keep track of which page it is currently on, it is impossible to mix language over pages, and the scheme does not cover Japanese and Chinese at all.
A group of computers companies decided to solve this problem by forming a consortium to create new system, called Unicode, and getting it proclaimed an International Standard (IS 10646). Unicode now supported by programming language (e.g Java), operating system (e.g Windows), and many applications.
The idea behind Unicode is to assign every characters and symbols a unique 16-bit value, called code point.
No multibyte characters or escape sequence are used.Having every symbol be 16 bits makes writing software simpler.
UNICODE
- 16-bit symbols
- has 65,536 code points
- world's languages collectively use about 200,000 symbols,code point are a scarce resource that must be allocated with great care.
- to speed the Unicode ,used Latin-1 as code points 0 to 255, making conversation between ASCII and Unicode easy.
- code point divided into blocks :-each one a multiple of 16 code points
- to allow user to invent special character for special purpose, 6400 code points have been allocated for local use.
UTF-8
UTF-8 data |
Although better than ASCII , Unicode eventually ran out of code points and it also requires 16 bits per character to represent pure ASCII text, which is wasteful.Consequently, another coding scheme was developed to address these concerns. It is called UTF-8 UCS Transformation Format where UCS stands for Universal Character Set, which is essentially Unicode. UTF-8 codes are variables length, from 1 to 4 bytes, and can code about two billion character. It is the dominant character set used on the World Wide Web.
what make UTF-8 nice is :
- codes 0 to 127 are the ASCII characters
- allowing them to be expressed in 1 byte (versus 2 bytes in Unicode)
- if program or document uses only characters that are in the ASCII character set, each can be represent in 8 bits.
- the first bytes of every UTF-8 characters uniquely determines the number of bytes in the characters
- continuation bytes in an UTF-8 character always start with 10
- making the code self synchronizing
I believe that if you show people the problems and you show them the solutions they will be moved to act
Bill Gatesposted by
Nur Aqeelah Napis(B031310448)
No comments:
Post a Comment