Language Observatory
2007-07-05
Installing Language Identification Module (LIM) web application in Windows
LIM web application running inside Apache Tomcat Servlet/JSP Container. Following is the requirements for LIM in Windows paltform.
- Java 2 Standard Edition Runtime Environment (JRE) version 5.0 or later.
- Apache Tomcat 5.0 or above
- Apache Ant 1.6.5 or above
=============================
Installing LIM for Windows
=============================
Unpack LIM.zip to c:\lim
=============================
Running With JDK 5.0 Or Later
=============================
p/s: Some part of the following guide is copy from Tomcat's installation guide.
As Java is the core engine, make sure it is installed.
Install the JDK according to the instructions included with the release.
Set an environment variable named JAVA_HOME to the pathname of the directory into which you installed the JDK, e.g. c:\j2sdk5.0 or /usr/local/java/j2sdk5.0.
=============================
Download and Install Apache Ant
=============================
Download latest Ant binary distributions from http://ant.apache.org/
Unpack and then setup the environment variable named ANT_HOME to where ant was installed, e.g. C:\apache-ant-1.7.0 or /usr/local/apache-ant-1.7.0
=============================
Download and Install the Tomcat Binary Distribution
=============================
Download a binary distribution of Tomcat from http://tomcat.apache.org
Unpack the binary distribution into a convenient location so that the distribution resides in its own directory (conventionally named "apache-tomcat-[version]").
Set an environment variable named CATALINA_HOME to the pathname of the directory into which you installed the Tomcat, e.g. C:\apache-tomcat-6.0.13 or /usr/local/apache-tomcat-6.0.13.
=============================
Install LIM web application to Tomcat
=============================
The are two groups of file to be install.
(1) LIM depends on lot of third part libraries for its function. Thus, copy the following to Tomcat library directory, e.g. copy c:\lim\lib\* $CATALINA_HOME\lib\
List of libraries:
antlr.jar
asm-attrs.jar
asm.jar
cglib-2.1.jar
commons-beanutils.jar
commons-cli-1.0.jar
commons-collections-2.1.1.jar
commons-digester.jar
commons-fileupload.jar
commons-lang-2.1.jar
commons-logging.jar
commons-validator.jar
dom4j-1.6.jar
ehcache-1.1.jar
fastutil-4.4.1.jar
hibernate3.jar
hsqldb.jar
htmlparser.jar
jakarta-oro.jar
jdbc2_0-stdext.jar
jta.jar
log4j-1.2.9.jar
mg4j-0.9.1.jar
struts.jar
utilx-1.2.jar
(2) Copy LIM web application, i.e. trainer.war to Tomcat's webapps directory, e.g. copy c:\lim\trainer.war $CATALINA_HOME\webapps\
=============================
Start LIM database
=============================
Make sure JRE and ANT binary files is in current search path. If not, use the following command to set it up:
set PATH=%PATH%;%JAVA_HOME%\bin;%ANT_HOME%\bin
Start database for LIM. You can do this by typing the following in command prompt:
cd c:\lim
db dbstart
=============================
Startup Tomcat
=============================
p/s: If you encounter problem starting LIM web application in Tomcat, try to change its minimum memory pool (at least 128) and maximum memory pool (at least 256).
The following command will increase the memory:
set CATALINA_OPTS="-Xms128m -Xmx256m"
Tomcat can then be started by executing the following commands:
$CATALINA_HOME\bin\startup.bat (for Windows)
Or $CATALINA_HOME/bin/startup.sh (for Unix/Linux)
Now you can access LIm by pointing your browser to http://localhost:8080/trainer
16:32:27 -
ycchew -
mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=892: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed
No comments
2006-04-23
How to define "Endangered Language"
Conventional Definition
"Atlas of the World's Languages in Danger of Disappearing"[1] defines five levels for endangerment of language.
| symbol | level | definition |
| △ | Potentially endangered language | decreasing numbers of cildren learn the language |
| ○ | Endangered language | the youngest speakers are young adults |
| ● | Seriously endangered language | the youngest speakers have reached or passed middle age |
| ⊕ | Moribund language | only a few elderly speakers are left |
| + | Extinct language | no speakers are left |
As shown above, basically "aging of speakers" is employed as a single criteria of endangerment. In another page of the publication, we find following definition: "What exactly does it mean when a language is referred to as being 'endangered'? Basically, the language of any community that is no longer learned by children, or at least by a larger part of the children of that community (say, at least 30 per cent), should be regarded as 'endangered' or at least 'potentially endangered'."[2]
So when number of children who learn a language declines and goes below 30 per cent of the generation, the language goes into the list of 'Endangered Language', then along with aging of the speakers, the language steps up the ladder of endangerment.
Wikipedia has another criteria for endangered langauge.[4] It lists following three criterion, but almost similar to the criteria mentioned above.
1. The number of speakers currently living.
2. The mean age of native and/or fluent speakers.
3. The percentage of the youngest generation acquiring fluency with the language in question.
Next Question
Here, I would like to ask a question "then, how to define
the level of endangerment of a language on the Internet?". Imagine, a language may disappear from the scene in the cyberspace even when many speakers keep using it as means of communication in daily life. Aging of speakers is of course the important factor, but disappearing of a language on the Internet happenes far before the point when the last speaker dies or ceases to use it. In the reality, we have rather to admit that many languages on the globe are
even not born yet on the Internet!!!
The criteria of the endangerment of language on the Internet is something more than the existence of speakers and the aging of them. It should cover much wider range of factors and phenomenons like availablity of written documents on the Internet, official use of a language by the e-government service, use of a language as medium of education and knowledge creation on the Internet, various technical tools which enable users to take advantage of the pool of electronic-form knowledge written in a language, etc.
My Proposal
In consdiering all these factors, I am proposing several criterion for that. The followings are the tentative list shows a few of those.
- Number of web pages written in a language
- Number of web pages divided by the number of speakers of a language
- Availability of a language at the government site
- Availability of a language at the university and/or other educational institutions' site
- Availability of online newspapers/magazines written in a language
- Availability of search engine(s) for pages written in a language
- Availability of online radio in a language
- Availability of chat rooms in a language
- Availability of globally standardized charset for the script used to write a language
- Frequency of updating of web pages written in a language
In principle, all these data can be drawn from Language Obsrvatory's crawled/analyzed database. I am waiting for your comments.
REFERENCE
[1] "Atlas of the World's Languages in Danger of Disappearing: Second edition, revised, enlarged and updated", Edited by Stephen A. Wurm, Cartographed by Ian Heyward, UNESCO Publising, 2001, ISBN:92-3-103255-0.
[2] ibid. p.14
[3]
Interactive Atlas of the World's Languages in Danger of Disappearing online
[4]
endangered language at Wikipedia
18:51:54 -
Mikami -
mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=658: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed
No comments
2006-03-04
Turkish Language Tutorial
The article was written by Dr. Ahmed Tarcan, a linguist at
Dicle Üniversitesi, Diyarbakir, Turkey. He has visited our laboratory in February - March 2006.
Turkish Language Family
Turkish belongs the Altay branch of the Ural-Altaic linguistic family, same as Finnish and Hungarian. It is the westernmost of the Turkic languages spoken across Central Asia and is generally classified as a member of the South-West group, also known as the Oguz group. Other Turkic languages, all of which are closely related, include Azerbaijani (Azeri), Kazakh, Kyrgyz, Tatar, Turkmen, Uighur, Uzbek, and many others spoken from the Balkans across Central Asia into northwestern China and southern Siberia. Turkic languages are often grouped with Mongolian and Tungusic languages in the Altaic language family. Strictly speaking, the "Turkish" languages spoken between Mongolia and Turkey should be called Turkic languages, and the term "Turkish" should refer to the language spoken in Turkey alone. It is common practice, however, to refer to all these languages as Turkish, and differentiate them with reference to the geographical area, for example, the Turkish language of Azerbaijan.
Speakers

Through the span of history, Turks have spread over a wide geographical area, taking their language with them. Turkish speaking people have lived in a wide area stretching from today's Mongolia to the north coast of the Black Sea, the Balkans, East Europe, Anatolia, Iraq and a wide area of northern Africa. Due to the distances involved, various dialects and accents have emerged. Turkish is also the language spoken at home by people who live in the areas that were governed by the Ottoman Empire. For instance, in Bulgaria there are over a million speakers. About 50,000 Turkish speakers live in Uzbekistan, Kazakhstan, Kyrgyzstan, Tajikistan, and Azerbaijan. In Cyprus, Turkish is a co-official language (with Greek) where it is spoken as a first language by 19 percent of the population, especially in the North (KKTC). Over 1.5 million speakers are found in Bulgaria, Macedonia, and Greece; over 2.5 million speakers live in Germany (and other northern European countries) where Turks have for many years been "guest workers." About 40,000 Turkish speakers live in the United States.
Dialects
Turkish has several dialects. The Turkish dialects can be divided into two major groups: Western dialects and Eastern dialects. Of the major Turkish dialects, Danubian appears to be the only member of the Western group. The following dialects make up the Eastern group: Eskisehir, Razgrad, Dinler, Rumelian, Karamanli, Edirne, Gaziantep, and Urfa. There are some other classifications that distinguish the following dialect groups: South-western, Central Anatolia, Eastern, Rumelian, and Kastamonu dialects. Modern standard Turkish is based on the Istanbul dialect of Anatolian.
History
The history of the language is divided into three main groups, old Turkish (from the 7th to the 13th centuries), mid-Turkish (from the 13th to the 20th) and new Turkish from the 20th century onwards. During the Ottoman Empire period Arabic and Persian words invaded the Turkish language and it consequently became mixed with three different languages. During the Ottoman period which spanned five centuries, the natural development of Turkish was severely hampered. Turkish formed the basis for Ottoman Turkish, the written language of the Ottoman Empire. Ottoman Turkish was basically Turkish in structure, but with a heavy overlay of Arabic and Persian vocabulary and an occasional grammatical influence. Ottoman Turkish co-existed with spoken Turkish, with the latter being considered a "gutter language" and not worthy of study. Ottoman Turkish, and the spoken language were both represented with an Arabic script.
Then there was the "new language" movement started by Kemal Atatürk. In 1928, five years after the proclamation of the Republic, the Arabic alphabet was replaced by the Latin one, which in turn speeded up the movement to rid the language of foreign words. Prior to the reform that introduced the Roman script, Turkish was written in the Arabic script. Up to the fifteenth century the Anatolian Turks used the Uighur script to write Turkish. The Turkish Language Institute (Turk Dil Kurumu) was established in 1932 to carry out linguistic research and contribute to the natural development of the language. As a consequence of these efforts, modern Turkish is a literary and cultural language developing naturally and free of foreign influences. Today literacy rates in Turkey are over 90%.
Grammer
Like all of the Turkic languages, Turkish is agglutinative, that is, grammatical functions are indicated by adding various suffixes to stems. Separate suffixes on nouns indicate both gender and number, but there is no grammatical gender. Nouns are declined in three declensions with six case endings: nominative, genitive, dative, accusative, locative, and ablative; number is marked by a plural suffix. Verbs agree with their subjects in case and number, and, as in nouns, separate identifiable suffixes perform these functions. The order of elements in a verb form is: verb stem + tense aspect marker + subject affix. There is no definite article; the number "one" may be used as an indefinite article.
Subject-Object-Verb word order in Turkish is a typical Turkic characteristic, but other orders are possible under certain discourse situations. As a SOV language where objects precede the verb, Turkish has postpositions rather than prepositions, and relative clauses that precede the verb.
Phonetics
Turkish has 8 vowels, and 21 consonants. It also has Turkic vowel harmony in which the vowels of suffixes must harmonize with the vowels of noun and verb stems; thus, for example, if the stem has a round vowel then the vowel of the suffix must be round, and so on. Stress on words pronounced in isolation is on the final syllable, but in discourse, stress assignment is complicated especially in the verb.
Alphabet
In Turkey's Turkish Q, X and W are not used, but in Tatar's Turkish these characters are used as well.
|
abcçdefgğhıijklmnoöprsştuüvyz
|
ANNEX: Scripts, Region and Speaking Population of Turkish Languages
| Language name | SIL code | Roman | Arabic | Cyrillic | region | population |
| Azerbaijani, South | azb | * | * | * | Iran | 24,364,000 |
| Azerbaijani, North | azj | * | * | | Azerbaijan | 7,059,529 |
| Chuvash | chv | | | * | Russia | 1,834,394 |
| Kazakh | kaz | * | * | * | Kazakhstan | 8,178,879 |
| Kirghiz | kir | * | * | * | Kyrghystan | 3,136,733 |
| Tatar | tat | * | | | Russia | 1,610,032 |
| Turkish | tur | * | | | Turkey | 50,625,794 |
| Turkmen | tuk | * | * | * | Turkmenistan | 6,403,533 |
| Uighur | uig | * | * | * | China | 7,601,431 |
| Uzbek, North | uzn | * | * | * | Uzbekistan | 18,795,591 |
| Uzbek, South | uzs | | * | | Afghanistan | 1,454,981 |
note: Ethnologue lists 40 individual langugages under Turkish language classifications. Only a part of them are listed above.
source: Ethnologue, 14th Edition
Related Sites
1.
Ethnologue > Turkish
2.
Wikipedia > Turkish language
3.
Wikipedia > Turkic language
15:14:02 -
Mikami -
mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=598: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed
No comments
2006-01-01
UTF-8 conversion
| char. | U+nnnn | Scalar Value | UTF-8 | | | |
| | | 1st Byte | 2nd Byte | 3rd Byte | in HEX |
| U+0000~ U+007F | 00000000 0xxxxxxx | 0xxxxxxx | | | |
| SPACE | U+0020 | 00000000 00100000 | 00100000 | | | 20 |
| M | U+0041 | 00000000 01000001 | 01000001 | | | 41 |
| DEL | U+007F | 00000000 01111111 | 01111111 | | | 7F |
| U+0080~ U+07FF | 00000yyy yyxxxxxx | 110yyyyy | 10xxxxxx | | |
| NBSP | U+00A0 | 00000000 10100000 | 11000010 | 10100000 | | C2 A0 |
| Ö | U+00D6 | 00000000 11010110 | 11000011 | 10010110 | | C3 96 |
| Ю | U+042E | 00000100 00101110 | 11010000 | 10101110 | | D0 AE |
| Ա | U+0531 | 00000101 00110001 | 11010100 | 10110001 | | D4 B1 |
| ث | U+062B | 00000110 00101011 | 11011000 | 10101011 | | D8 AB |
| U+0800~ U+FFFF | zzzzyyyy yyxxxxxx | 1110zzzz | 10yyyyyy | 10xxxxxx | |
| अ | U+0905 | 00001001 00000101 | 11100000 | 10100100 | 10000101 | E0 A4 85 |
| অ | U+0985 | 00001001 10000101 | 11100000 | 10100110 | 10000101 | E0 A6 85 |
| ਅ | U+0A05 | 00001010 00000101 | 11100000 | 10101000 | 10000101 | E0 A8 85 |
| அ | U+0B85 | 00001011 10000101 | 11100000 | 10101110 | 10000101 | E0 AE 85 |
| ಅ | U+0C85 | 00001100 10000101 | 11100000 | 10110010 | 10000101 | E0 B2 85 |
| ZWNJ | U+200C | 00100000 00001100 | 11100010 | 10000000 | 10001100 | E2 80 8C |
| ZWJ | U+200D | 00100000 00001101 | 11100010 | 10000000 | 10001101 | E2 80 8D |
| あ | U+3042 | 00110000 01000010 | 11100011 | 10000001 | 10000010 | E3 81 82 |
| 天 | U+5929 | 01011001 00101001 | 11100101 | 10100100 | 10101001 | E5 A4 A9 |
| 文 | U+6587 | 01100101 10000111 | 11100110 | 10010110 | 10000111 | E6 96 87 |
| 台 | U+53F0 | 01010011 11110000 | 11100101 | 10001111 | 10110000 | E5 8F B0 |
For detail of Unicode Encoding Forms, see
"3.9 Unicode Encoding Forms" at Chapter 3 of the UNICODE version 4.0.
23:40:50 -
Mikami -
mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=508: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed
No comments
2005-12-28
Page size distribution and MEGA HOST problem
Based on the most recent African web crawl data, I estimated per host page size distribution. The most popular page size per host is around 20,000 to 40,000. But a few large hosts contain more than a million pages. In the African web, we found some 50 such
MEGA HOSTs. Most of these are under South African country domain(.za), and seem to be run by a single hosting service company (
http://id.co.za/domain/). These hosts are providing at least 50 million pages, well comparable to the size of entire African web hosted by the remaining servers.
These
MEGA HOSTs are really a headache to us. It deteriorates time performance of crawling. As far as we keep the current "polite" and "modest" crawling policy (i.e. minimum 5 seconds interval between successive HTTP request to the same host), a simple calculation tells us that it will take at least 57 days = (1,000,000 pages * 5 seconds) / (3,600 sec. * 24 hours) for complete downloading of a single MEGA HOST.
To avoind this, first option is to set a lower threshold of page-limit to a single host. In this case, however, we will lose unknown but substantial amount of pages (currently we set a million as page-limit threshold). The second option is to adopt "URL-hash" instead of "Site-hash." But this option may create another problem. "URL-hash" inevitably requires much more frequent communications between threads (or agents) and increased overhead will result deterioration of downloading speed in another way! (for detail, see [1])
It's really a headache!
[1] Junghoo Cho & Hector Garcia-Molina,
Parallel Crawlers, WWW2002, May 7-11, 2002.
03:40:25 -
Mikami -
mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=505: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed
No comments
2005-12-08
How to identify languages?
Language Trainer and
Identificator
User’s Manual
LOP
Nagaoka University of Technology
Language Observatory
Project
Nov. 5, 2005
v0.1
documentation
prepared by
GÖNDRI
NAGY János
Quick Start
The LOP Team’s Language Trainer and Identificator can
help you find out the language, script and encoding of texts of more than 300
languages. And you can even contribute to extending this number by submitting
sample texts of good quality. All our efforts are in line with supporting the
abolishment of the ‘digital language divide’.
If you meet the requirements, you can start using the
web-based application right now.
- Start your web browser
All the major web browsers should
function well; Internet Explorer or Firefox may be as well.
- Open the address: http://gii2.nagaokaut.ac.jp/trainer/
You may find out that the web
application appears in your own language. Since we are committed to
internationalization (i18n), we began to develop the site using i18n
principles. To change language, modify the language setting of your browser,
and click on the Refresh button on top.
- To try the Identificator module, click on the Identification
menu on the left.
For the length of the text there
is not explicit rule, but as a rule of thumb: the longer the text is, the
better result the analysis gives. Inputting less
than 30 characters, the output may be ambiguous.
- You can type your text directly into the Text input
field.
- You can type a valid URL of a single web page. ‘Cut and
copy’ is recommended.
- You can upload a local file for indentification by
clicking on the Browse button.
- To start the identification of the text you already
inputted, click on the Identify button.
The Result will display a list of
language, script and encoding pairs in descending order according to the
similarity of your inputted text and the texts already residing in our
database.
18:41:54 -
Janos -
mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=479: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed
No comments
2005-11-28
Language Observatory Tutorials
This tutorial is to serve as a convenient online learning material to those who are interested in the activities of the Language Observatory. Following contents will be provided very soon.
§1:::How many languages are spoken on the globe?
Ethnologue: Languages of the World cataloging 6,912 known living languages.
List of languages on iLoveLanguages.
§2:::Language names and codes
Language names on Ethnologue: Languages of the World.
Language codes on Ethnologue: Languages of the World.
§3:::Endangered languages
Endangered language, from Wikipedia.
UNESCO's Atlas on Endangered Languages Project.
The International Clearing House for Endangered Languages in University of Tokyo.
§4:::How many writing systems are used?
List of languages by writing system, from Wikipedia.
§5:::Taxonomy of writing systems
§6:::Forgotten scripts
Forgotten Scripts By Dino Manzella.
§7:::What is charset?
Character encoding, from Wikipedia.
IANA Character Sets.
W3C's tutorial: Character sets & encodings in XHTML, HTML and CSS.
§8:::What is UCS/Unicode?
Universal Character Set, from Wikipedia.
Unicode, from Wikipedia.
Unicode Home Page.
§9:::What is UTF-8?
UTF-8, from Wikipedia.
UTF-8 and Unicode Standards.
UTF-8 and Unicode FAQ for Unix/Linux.
§10:::ccTLD, country code, and country
 
IANA ccTLD Database.
§11:::How crawler robot works?
Web crawler, from Wikipedia.
The Web Robots Pages.
§12:::How seedURLs are collected?
§13:::How to identify languages?
Language Observatory's Language Trainer and Identificator.
Some Language Identification Tools.
23:55:03 -
Mikami -
mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=447: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed
No comments