Installing Language Identification Module (LIM) web application in Windows

LIM web application running inside Apache Tomcat Servlet/JSP Container. Following is the requirements for LIM in Windows paltform.

- Java 2 Standard Edition Runtime Environment (JRE) version 5.0 or later.
- Apache Tomcat 5.0 or above
- Apache Ant 1.6.5 or above

Installing LIM for Windows

Unpack to c:\lim

Running With JDK 5.0 Or Later

p/s: Some part of the following guide is copy from Tomcat's installation guide.

As Java is the core engine, make sure it is installed.

Install the JDK according to the instructions included with the release.

Set an environment variable named JAVA_HOME to the pathname of the directory into which you installed the JDK, e.g. c:\j2sdk5.0 or /usr/local/java/j2sdk5.0.

Download and Install Apache Ant

Download latest Ant binary distributions from
Unpack and then setup the environment variable named ANT_HOME to where ant was installed, e.g. C:\apache-ant-1.7.0 or /usr/local/apache-ant-1.7.0

Download and Install the Tomcat Binary Distribution

Download a binary distribution of Tomcat from

Unpack the binary distribution into a convenient location so that the distribution resides in its own directory (conventionally named "apache-tomcat-[version]").

Set an environment variable named CATALINA_HOME to the pathname of the directory into which you installed the Tomcat, e.g. C:\apache-tomcat-6.0.13 or /usr/local/apache-tomcat-6.0.13.

Install LIM web application to Tomcat

The are two groups of file to be install.

(1) LIM depends on lot of third part libraries for its function. Thus, copy the following to Tomcat library directory, e.g. copy c:\lim\lib\* $CATALINA_HOME\lib\

List of libraries:

(2) Copy LIM web application, i.e. trainer.war to Tomcat's webapps directory, e.g. copy c:\lim\trainer.war $CATALINA_HOME\webapps\

Start LIM database

Make sure JRE and ANT binary files is in current search path. If not, use the following command to set it up: set PATH=%PATH%;%JAVA_HOME%\bin;%ANT_HOME%\bin

Start database for LIM. You can do this by typing the following in command prompt:
cd c:\lim
db dbstart

Startup Tomcat

p/s: If you encounter problem starting LIM web application in Tomcat, try to change its minimum memory pool (at least 128) and maximum memory pool (at least 256).

The following command will increase the memory:
set CATALINA_OPTS="-Xms128m -Xmx256m"

Tomcat can then be started by executing the following commands:
$CATALINA_HOME\bin\startup.bat (for Windows)
Or $CATALINA_HOME/bin/ (for Unix/Linux)

Now you can access LIm by pointing your browser to http://localhost:8080/trainer
How to define "Endangered Language"

Conventional Definition

"Atlas of the World's Languages in Danger of Disappearing"[1] defines five levels for endangerment of language.

Potentially endangered languagedecreasing numbers of cildren learn the language
Endangered languagethe youngest speakers are young adults
Seriously endangered languagethe youngest speakers have reached or passed middle age
Moribund languageonly a few elderly speakers are left
+Extinct languageno speakers are left

As shown above, basically "aging of speakers" is employed as a single criteria of endangerment. In another page of the publication, we find following definition: "What exactly does it mean when a language is referred to as being 'endangered'? Basically, the language of any community that is no longer learned by children, or at least by a larger part of the children of that community (say, at least 30 per cent), should be regarded as 'endangered' or at least 'potentially endangered'."[2]

So when number of children who learn a language declines and goes below 30 per cent of the generation, the language goes into the list of 'Endangered Language', then along with aging of the speakers, the language steps up the ladder of endangerment.

Wikipedia has another criteria for endangered langauge.[4] It lists following three criterion, but almost similar to the criteria mentioned above.
1. The number of speakers currently living.
2. The mean age of native and/or fluent speakers.
3. The percentage of the youngest generation acquiring fluency with the language in question.

Next Question

Here, I would like to ask a question "then, how to define the level of endangerment of a language on the Internet?". Imagine, a language may disappear from the scene in the cyberspace even when many speakers keep using it as means of communication in daily life. Aging of speakers is of course the important factor, but disappearing of a language on the Internet happenes far before the point when the last speaker dies or ceases to use it. In the reality, we have rather to admit that many languages on the globe are even not born yet on the Internet!!!

The criteria of the endangerment of language on the Internet is something more than the existence of speakers and the aging of them. It should cover much wider range of factors and phenomenons like availablity of written documents on the Internet, official use of a language by the e-government service, use of a language as medium of education and knowledge creation on the Internet, various technical tools which enable users to take advantage of the pool of electronic-form knowledge written in a language, etc.

My Proposal

In consdiering all these factors, I am proposing several criterion for that. The followings are the tentative list shows a few of those.

  1. Number of web pages written in a language
  2. Number of web pages divided by the number of speakers of a language
  3. Availability of a language at the government site
  4. Availability of a language at the university and/or other educational institutions' site
  5. Availability of online newspapers/magazines written in a language
  6. Availability of search engine(s) for pages written in a language
  7. Availability of online radio in a language
  8. Availability of chat rooms in a language
  9. Availability of globally standardized charset for the script used to write a language
  10. Frequency of updating of web pages written in a language
In principle, all these data can be drawn from Language Obsrvatory's crawled/analyzed database. I am waiting for your comments.

[1] "Atlas of the World's Languages in Danger of Disappearing: Second edition, revised, enlarged and updated", Edited by Stephen A. Wurm, Cartographed by Ian Heyward, UNESCO Publising, 2001, ISBN:92-3-103255-0.
[2] ibid. p.14
[3] Interactive Atlas of the World's Languages in Danger of Disappearing online
[4] endangered language at Wikipedia

Turkish Language Tutorial

The article was written by Dr. Ahmed Tarcan, a linguist at Dicle Üniversitesi, Diyarbakir, Turkey. He has visited our laboratory in February - March 2006.

Turkish Language Family

Turkish belongs the Altay branch of the Ural-Altaic linguistic family, same as Finnish and Hungarian. It is the westernmost of the Turkic languages spoken across Central Asia and is generally classified as a member of the South-West group, also known as the Oguz group. Other Turkic languages, all of which are closely related, include Azerbaijani (Azeri), Kazakh, Kyrgyz, Tatar, Turkmen, Uighur, Uzbek, and many others spoken from the Balkans across Central Asia into northwestern China and southern Siberia. Turkic languages are often grouped with Mongolian and Tungusic languages in the Altaic language family. Strictly speaking, the "Turkish" languages spoken between Mongolia and Turkey should be called Turkic languages, and the term "Turkish" should refer to the language spoken in Turkey alone. It is common practice, however, to refer to all these languages as Turkish, and differentiate them with reference to the geographical area, for example, the Turkish language of Azerbaijan.


Turkish World Map

Through the span of history, Turks have spread over a wide geographical area, taking their language with them. Turkish speaking people have lived in a wide area stretching from today's Mongolia to the north coast of the Black Sea, the Balkans, East Europe, Anatolia, Iraq and a wide area of northern Africa. Due to the distances involved, various dialects and accents have emerged. Turkish is also the language spoken at home by people who live in the areas that were governed by the Ottoman Empire. For instance, in Bulgaria there are over a million speakers. About 50,000 Turkish speakers live in Uzbekistan, Kazakhstan, Kyrgyzstan, Tajikistan, and Azerbaijan. In Cyprus, Turkish is a co-official language (with Greek) where it is spoken as a first language by 19 percent of the population, especially in the North (KKTC). Over 1.5 million speakers are found in Bulgaria, Macedonia, and Greece; over 2.5 million speakers live in Germany (and other northern European countries) where Turks have for many years been "guest workers." About 40,000 Turkish speakers live in the United States.


Turkish has several dialects. The Turkish dialects can be divided into two major groups: Western dialects and Eastern dialects. Of the major Turkish dialects, Danubian appears to be the only member of the Western group. The following dialects make up the Eastern group: Eskisehir, Razgrad, Dinler, Rumelian, Karamanli, Edirne, Gaziantep, and Urfa. There are some other classifications that distinguish the following dialect groups: South-western, Central Anatolia, Eastern, Rumelian, and Kastamonu dialects. Modern standard Turkish is based on the Istanbul dialect of Anatolian.


The history of the language is divided into three main groups, old Turkish (from the 7th to the 13th centuries), mid-Turkish (from the 13th to the 20th) and new Turkish from the 20th century onwards. During the Ottoman Empire period Arabic and Persian words invaded the Turkish language and it consequently became mixed with three different languages. During the Ottoman period which spanned five centuries, the natural development of Turkish was severely hampered. Turkish formed the basis for Ottoman Turkish, the written language of the Ottoman Empire. Ottoman Turkish was basically Turkish in structure, but with a heavy overlay of Arabic and Persian vocabulary and an occasional grammatical influence. Ottoman Turkish co-existed with spoken Turkish, with the latter being considered a "gutter language" and not worthy of study. Ottoman Turkish, and the spoken language were both represented with an Arabic script.

Then there was the "new language" movement started by Kemal Atatürk. In 1928, five years after the proclamation of the Republic, the Arabic alphabet was replaced by the Latin one, which in turn speeded up the movement to rid the language of foreign words. Prior to the reform that introduced the Roman script, Turkish was written in the Arabic script. Up to the fifteenth century the Anatolian Turks used the Uighur script to write Turkish. The Turkish Language Institute (Turk Dil Kurumu) was established in 1932 to carry out linguistic research and contribute to the natural development of the language. As a consequence of these efforts, modern Turkish is a literary and cultural language developing naturally and free of foreign influences. Today literacy rates in Turkey are over 90%.


Like all of the Turkic languages, Turkish is agglutinative, that is, grammatical functions are indicated by adding various suffixes to stems. Separate suffixes on nouns indicate both gender and number, but there is no grammatical gender. Nouns are declined in three declensions with six case endings: nominative, genitive, dative, accusative, locative, and ablative; number is marked by a plural suffix. Verbs agree with their subjects in case and number, and, as in nouns, separate identifiable suffixes perform these functions. The order of elements in a verb form is: verb stem + tense aspect marker + subject affix. There is no definite article; the number "one" may be used as an indefinite article.

Subject-Object-Verb word order in Turkish is a typical Turkic characteristic, but other orders are possible under certain discourse situations. As a SOV language where objects precede the verb, Turkish has postpositions rather than prepositions, and relative clauses that precede the verb.


Turkish has 8 vowels, and 21 consonants. It also has Turkic vowel harmony in which the vowels of suffixes must harmonize with the vowels of noun and verb stems; thus, for example, if the stem has a round vowel then the vowel of the suffix must be round, and so on. Stress on words pronounced in isolation is on the final syllable, but in discourse, stress assignment is complicated especially in the verb.


In Turkey's Turkish Q, X and W are not used, but in Tatar's Turkish these characters are used as well.

ANNEX: Scripts, Region and Speaking Population of Turkish Languages
Language nameSIL codeRomanArabicCyrillicregionpopulation
Azerbaijani, Southazb***Iran24,364,000
Azerbaijani, Northazj**Azerbaijan7,059,529
Uzbek, Northuzn***Uzbekistan18,795,591
Uzbek, Southuzs*Afghanistan1,454,981

note: Ethnologue lists 40 individual langugages under Turkish language classifications. Only a part of them are listed above.
source: Ethnologue, 14th Edition

UTF-8 conversion

char.U+nnnnScalar ValueUTF-8
1st Byte2nd Byte3rd Bytein HEX
00000000 0xxxxxxx0xxxxxxx
SPACEU+002000000000 001000000010000020
MU+004100000000 010000010100000141
DELU+007F00000000 01111111011111117F
00000yyy yyxxxxxx110yyyyy10xxxxxx
NBSPU+00A000000000 101000001100001010100000C2 A0
ÖU+00D600000000 110101101100001110010110C3 96
ЮU+042E00000100 001011101101000010101110D0 AE
ԱU+053100000101 001100011101010010110001D4 B1
ثU+062B00000110 001010111101100010101011D8 AB
zzzzyyyy yyxxxxxx1110zzzz10yyyyyy10xxxxxx
U+090500001001 00000101111000001010010010000101E0 A4 85
U+098500001001 10000101111000001010011010000101E0 A6 85
U+0A0500001010 00000101111000001010100010000101E0 A8 85
U+0B8500001011 10000101111000001010111010000101E0 AE 85
U+0C8500001100 10000101111000001011001010000101E0 B2 85
ZWNJU+200C00100000 00001100111000101000000010001100E2 80 8C
ZWJU+200D00100000 00001101111000101000000010001101E2 80 8D
U+304200110000 01000010111000111000000110000010E3 81 82
U+592901011001 00101001111001011010010010101001E5 A4 A9
U+658701100101 10000111111001101001011010000111E6 96 87
U+53F001010011 11110000111001011000111110110000E5 8F B0

For detail of Unicode Encoding Forms, see "3.9 Unicode Encoding Forms" at Chapter 3 of the UNICODE version 4.0.

Page size distribution and MEGA HOST problem

Based on the most recent African web crawl data, I estimated per host page size distribution. The most popular page size per host is around 20,000 to 40,000. But a few large hosts contain more than a million pages. In the African web, we found some 50 such MEGA HOSTs. Most of these are under South African country domain(.za), and seem to be run by a single hosting service company ( These hosts are providing at least 50 million pages, well comparable to the size of entire African web hosted by the remaining servers.

per host page size distribution
These MEGA HOSTs are really a headache to us. It deteriorates time performance of crawling. As far as we keep the current "polite" and "modest" crawling policy (i.e. minimum 5 seconds interval between successive HTTP request to the same host), a simple calculation tells us that it will take at least 57 days = (1,000,000 pages * 5 seconds) / (3,600 sec. * 24 hours) for complete downloading of a single MEGA HOST.

To avoind this, first option is to set a lower threshold of page-limit to a single host. In this case, however, we will lose unknown but substantial amount of pages (currently we set a million as page-limit threshold). The second option is to adopt "URL-hash" instead of "Site-hash." But this option may create another problem. "URL-hash" inevitably requires much more frequent communications between threads (or agents) and increased overhead will result deterioration of downloading speed in another way! (for detail, see [1])

It's really a headache!

[1] Junghoo Cho & Hector Garcia-Molina, Parallel Crawlers, WWW2002, May 7-11, 2002.

How to identify languages?

Language Trainer and Identificator


User’s Manual




Nagaoka University of Technology

Language Observatory Project


Nov. 5, 2005


documentation prepared by



Quick Start

The LOP Team’s Language Trainer and Identificator can help you find out the language, script and encoding of texts of more than 300 languages. And you can even contribute to extending this number by submitting sample texts of good quality. All our efforts are in line with supporting the abolishment of the ‘digital language divide’.

If you meet the requirements, you can start using the web-based application right now.

  1. Start your web browser

All the major web browsers should function well; Internet Explorer or Firefox may be as well.

  1. Open the address:

You may find out that the web application appears in your own language. Since we are committed to internationalization (i18n), we began to develop the site using i18n principles. To change language, modify the language setting of your browser, and click on the Refresh button on top.

  1. To try the Identificator module, click on the Identification menu on the left.

For the length of the text there is not explicit rule, but as a rule of thumb: the longer the text is, the better result the analysis gives. Inputting less than 30 characters, the output may be ambiguous.

    1. You can type your text directly into the Text input field.
    2. You can type a valid URL of a single web page. ‘Cut and copy’ is recommended.
    3. You can upload a local file for indentification by clicking on the Browse button.
  1. To start the identification of the text you already inputted, click on the Identify button.

The Result will display a list of language, script and encoding pairs in descending order according to the similarity of your inputted text and the texts already residing in our database.

Language Observatory Tutorials

This tutorial is to serve as a convenient online learning material to those who are interested in the activities of the Language Observatory. Following contents will be provided very soon.

§1:::How many languages are spoken on the globe?
  Ethnologue: Languages of the World cataloging 6,912 known living languages.
  List of languages on iLoveLanguages.

§2:::Language names and codes
  Language names on Ethnologue: Languages of the World.
  Language codes on Ethnologue: Languages of the World.

§3:::Endangered languages
  Endangered language, from Wikipedia.
  UNESCO's Atlas on Endangered Languages Project.
  The International Clearing House for Endangered Languages in University of Tokyo.

§4:::How many writing systems are used?
  List of languages by writing system, from Wikipedia.

§5:::Taxonomy of writing systems

§6:::Forgotten scripts
  Forgotten Scripts By Dino Manzella.

§7:::What is charset?
  Character encoding, from Wikipedia.
  IANA Character Sets.
  W3C's tutorial: Character sets & encodings in XHTML, HTML and CSS.

§8:::What is UCS/Unicode?
  Universal Character Set, from Wikipedia.
  Unicode, from Wikipedia.
  Unicode Home Page.

§9:::What is UTF-8?
  UTF-8, from Wikipedia.
  UTF-8 and Unicode Standards.
  UTF-8 and Unicode FAQ for Unix/Linux.

§10:::ccTLD, country code, and country
  IANA ccTLD Database.

§11:::How crawler robot works?
  Web crawler, from Wikipedia.
  The Web Robots Pages.

§12:::How seedURLs are collected?

§13:::How to identify languages?
  Language Observatory's Language Trainer and Identificator.
  Some Language Identification Tools.

