These days, that's less than half of the global online population. Yet most of the world's websites -- and nearly all e-commerce sites -- are in English and other languages based on Roman characters.
"The Internet has always been controlled by Western people," said Byung-Kyu Kim, director of address management at the Korea Network Information Center, expressing the frustration of would-be surfers who log on to encounter what amounts to a giant English-only zone.
And even as companies strive to go multilingual -- and extend their brands to the global buying public -- they're learning it isn't as easy as putting the local character set on a user's screen. There's a lot of back-end work that has to be done -- namely in writing code that's as usable in Korea as it is in Kazakhstan. It involves building domains, creating HTML commands, and setting up e-commerce apps -- all in languages not based on the Roman alphabet, some of which (like Japanese) have up to 6,000 characters.
Enter Unicode. The emerging standard for representing international character sets surpasses ASCII and its paltry 256-character limit: It can handle up to 65,000 characters, enough to map not only Japanese but "theoretically all known alphabet schemes and still have room left over for expansion," according to The Unicode Consortium, Mountain View, Calif. It also works with the current lingua franca of Web design, HTML, and its successor, XML.
Unicode is coming along none too soon, if statistics, trends -- and the frustration of non-English-speaking users -- are any indication.
"Already, users who speak English as their primary language constitute only a little more than half of all persons using the Net," said Bill Myers, CEO of the United States Internet Council, Washington.
Then there's the e-commerce angle. Forrester Research says that by 2004, half of all online commerce will take place outside the U.S., making globalization a necessity rather than a sideline. In markets where European languages are not widely read, lacking a local-language website will mean losing sales.
And there are more and more people going online everyday.
The combination of high PC ownership and higher levels of adult literacy (in East Asia, it's nearly total) is one element driving global Internet growth. So is increasing use of mobile phones and other wireless devices with Internet access, according to Sergey Brin, president and co-founder of search engine service Google.com, which recently introduced Asian language features. Some mobile providers, such as NTT DoCoMo, Tokyo, already use multiple character sets.
But Web development in non-European languages isn't easy. HTML, like all major programming languages, is English-based.
And users in some parts of the world are acutely aware of it.
"The confusion in the character set scene and the complexity of displaying Arabic text have restricted the growth of Web use in Arabic," said Badr H. al-Badr of King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia.
"The problems are not related to the features of Arabic text," he said. "Rather, they are by-products of Internet protocols originating in the Western world."
How many websites, and how many users, would benefit from improved multi-language technology? Domain-name registrations offer some indication.
According to the Internet Software Consortium, Redwood City, Calif., there are about 5.4 million hosts in seven major countries -- China, Hong Kong, Israel, Japan, Russia, South Korea, and Taiwan -- using non-Roman characters. That contrasts to about 65 million using the major U.S. top-level domains such as .com, which is also used by many non-U.S. sites.
And Network Solutions Inc., through its Dotcom.com research arm, expects that by 2002, more domains will be registered each month outside the U.S. than in, with South Korea leading the charge.
The amount of people speaking specific languages also should be factored in. In terms of numbers, Chinese is in the lead -- with 885 million speakers, according to Ethnologue, Dallas. There are half as many speakers of English, which is second on the list.
Sheer numbers are only part of the story, though: There may be only 126 million speakers of Japanese, for example, but high penetration of Internet-enabled devices make Japan a coveted e-commerce market.
The crucial issue for developers, then, is cracking markets where literacy and PC and wireless penetration are high, said Brian O'Shaughnessy, director of policy communications at Network Solutions, Herndon, Va. That would make China, Japan, Russia, South Korea, and Thailand key targets, based on GDP figures from Keynote Publishing Co.'s PoliSci.com and on adult-literacy stats from the International Federation of Library Associations and Institutions.
But developing systems for non-Roman scripts is tough. Classical Chinese, like Japanese, uses thousands of characters. In Arabic, explained al-Badr, "the shapes of characters depend on their position in the word." Like Hebrew, the language is usually written right to left, while quotations from English text run left to right. Other Asian languages sometimes run top to bottom.
Before Unicode, developers and website owners attacked the problem in a number of ways. For example, the Multilingual-HTML Browser Project at the University of Library and Information Science, Ibaraki, Japan, developed an offshoot of HTML called MHTML, in which the coding for each Web page contained the images necessary to build up the character set used on that page.
Some sites relied on image-based alternatives, such as converting pages to images or sending each character as a separate picture. Others used Adobe's Acrobat document format to publish in non-European languages, but that stretched download times and required Acrobat Reader software. There were even browsers specific to international character sets, but none achieved wide acceptance.
But Unicode is fundamentally different from these approaches in that it works on a principle similar to ASCII. There's no radical change in the way Web pages are created, served, and interpreted by browser software. Because it works with HTML, Web designers can specify any of some 35 international scripts to be used in their pages, according to the World Wide Web Consortium at MIT, Cambridge, Mass. These include the Han or Kanji ideographs used for Chinese, Japanese, and Korean, as well as Arabic, modern Greek, Cyrillic, Hebrew, Thai, Cherokee, Khmer -- and even Roman.
With a browser supporting Unicode and HTML 4.01, all a user would need to display a site in a non-Roman alphabet is the appropriate font set. Unicode works with versions of Microsoft Internet Explorer and Netscape Navigator greater than 4.0, and Microsoft's Global Input Method Editor allows users of Explorer to type Chinese, Japanese, or Korean text into Web forms and e-mails -- even if their PC is not set up to use those character sets normally.
But reading pages is only half of the story. If the Web is to become truly global, users must be able to search -- and operators have to be able to register domain names -- in multiple character sets.
Coming Wednesday in part two of Building The Tower Of Babel: What website operators are doing to help users search the Web in their own languages, and the changes that are in store for DNS.