Language Observatory


What is "Language Observatory"?

Astronomical observatory surveys stars in the universe. Likewise, the Language Observatory surveys language activities in the virtual universe over the Internet. The former catches the weakest light from stars. Likewise, the Language Observatory tries to catch subtle messages of less spoken languages, as far as they appear on the Internet, and answers such questions like;

- How many languages are found on the Internet?
- How many web pages are written by any given language/script under specific country code domain (ccTLD)?
- What kind of character encoding schemes (CESs) are employed to encode a given language?
- How quickly UCS/Unicode is spreading?
- To what extent open-source software (OSS) technologies are employed by specific language community?
- How specific language community is linked together with other language communities? (web-graph analysis)

"In the galaxy of languages, every word is a star."... UNESCO

To read more "about us"
1. Objectives of the Language Observatory
2. Who runs the Language Observatory?
3. How the Language Observatory actually works?
4. What kind of reports the Language Observatory will produce?
5. Language Observatory Project (LOP) Milestones
6. Language Observatory Staff
7. International Advisors
8. Logo Story

20:24:06 - Mikami - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=439: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

Objectives of the Language Observatory

Although over 6,000 languages are currently spoken on the globe (see footnote [1]), only a few of them has been properly represented in the virtual universe of the Internet. We call this situation "Digital Divide among Languages" or just "Digital Language Divide" (see another blog article). We share the same concerns with UNESCO on this point when the latter stressing the importance of "the preservation of a balanced use of languages in cyberspace". Here, objectives of the Language Observatory Project can be stated as;

- To raise public awareness on "Digital Language Divide" issues

- To encourage support to the processing of those languages now falling through the net.

[1] Although experts estimate there are more than 6,000 languages on the globe, the project will try to cover 300+ languages which appear at Alphabetical listing of all translations of the Universal Declaration of Human Rights (UDHR). This coverage is still far beyond the scope of currently available commercial search engines.

20:23:51 - Mikami - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=440: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

Who runs the Language Observatory?

The project is currently funded by Japan Science and Technology Agency (JST)under RISTEX program, and is implemented by the partnership of several institutions:

- Nagaoka University of Technology (NUT), Japan
- Keio University, Japan
- Tokyo University of Foreign Studies (TUFS), Japan
- Fakulti Sains Komputer & Sistem Maklumat (FSKSM), Universiti Teknologi Malaysia (UTM), Malaysia
- Thai Computational Linguistic Laboratory (TCL), Thailand
- Miskolc University, Hungary
- Technology Development of Indian Languages (TDIL), Ministry of IT, India
- The Laboratory for Web Algorithmics (LAW), Dipartimento di Scienze dell'Informazione, Università degli Studi di Milano (USM), Milano, Italy

20:23:34 - Mikami - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=441: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

How the Language Observatory actually works?

The Language Observatory works through the following steps.

- Crawler Robots visit pages on the Internet at least once a year, and fetch text content. These robots return back to the same page regularly so as to produce a periodical report.

- Language Identification Module (LIM) analyses the page content and identifies language property (language, script and character encoding scheme, etc.) of the page. LIM is trained by the contribution of language experts.

- The Observatory counts up number of pages according to their language properties and compiles a regular report.

- The Observatory also analises HTML-tag information and link information to reveal open-source usage status, web graph structure, etc.

20:23:18 - Mikami - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=442: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

What kind of reports will be produced?

The observatory regularly publishes the following reports.

1. The Cyber Census Report:

This report reveals the activity level of all the languages observable on the web pages of the Internet. It contains statistics on;
--- number of pages and bytes by language
--- number of pages and bytes by script
--- number of pages and bytes by character encoding scheme (CES)
--- relative share in the cyberspace by language
The report, we believe, will reveal actual usage of languages in the virtual universe, and will help policy makers and international organizations to understand real picture of unbalanced usage of languages in cyberspace.

2. The CES Report:

The report describes what kind of character encoding scheme is employed to encode each languages and scripts. Although most of the pages on the Internet are encoded by widely known CESs, like ASCII or ISO/IEC 8859 series for Latin script, ASMO 449 for Arabic script, JIS or ISO-2022-JP for Japanese script, and so on, some pages are encoded by locally developed "fonts" to represent local language. These locally developed fonts are not just a font, but a kind of implicit character encoding scheme. So we call this "Implicit CES". Even after the emergence of Multi-Octet Character Set ISO/IEC 10646, which covers almost all the scripts currently used in the world, still many Implicit CESs are under use because of unavailability of OpenType technologies or just for easiness. Here, chaos in CES is invited.
The report, we believe, will help ICT engineers and policy makers to understand the technical problems behind "Digital Language Divide". Technical specifications of observed Implicit CESs will be provided through the report and will be used to develop converters.

3. The Corpus Statistics Report:

The report gives various statistics of text data. Statistics includes;
--- Single byte/character distribution for given CES text
--- Two bytes/characters distribution for given CES text
--- Three bytes/characters distribution for given CES text
--- Single word distribution for given language/script text

20:13:43 - Mikami - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=443: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments


Language Observatory Staff

Project Management
DirectorYoshiki Mikami
Co-Director (system)Pavol Zavarsky
Co-Director (linguistic)Makoto Minegishi (TUFS)
Co-Director (linguistic)Kazuhiko Machida (TUFS)
CoordinatorTomoe Takahashi
CoordinatorJun Sugawara (TUFS/JST)
International LiaisonV. Narayanan (Excel Solutions, Singapore)
S. T. Nandasara (Univ. of Colombo, Sri Lanka)
LI ManagerJános Göndri Nagy (NUT/Miskolc Univ., Hungary)
Crawl ManagerChew Yew Choong
SeedURL ManagerTomoe Takahashi
System AdministratorChew Yew Choong
Junichiro Chisuga
Graphic DesignCarla Salem (Nagaoka Institute of Design)
Server/Network System
System DesignYoshihide Chubachi (NUT/JST, Keio Univ.)
Network DesignKatsuko T. Nakahira
Masayuki Takahashi
Tomohide Maki
Chew Yew Choong
Keisuke Koda
Language Identification
System DesignYoshihide Chubachi (NUT/JST, Keio Univ.)
Algorithm DesignIzumi Suzuki
ResearcherJános Göndri Nagy (NUT/Miskolc Univ., Hungary)
Wunna Ko Ko
Shota Wada
Rizza Caminero
Keisuke Koda
Data Analysis
Contents AnalysisMohd Zaidi abd Rozan (NUT/UTM)
Masayuki Takahashi
Network AnalysisKatsuko T. Nakahira
Tetsuya Hoshino
Language AnalysisJános Göndri Nagy (NUT/Miskolc Univ., Hungary)
Robin Nagano Lee (Miskolc Univ., Hungary)
Charset AnalysisJános Göndri Nagy (NUT/Miskolc Univ., Hungary)
Ryosuke Nakao
Web Graph AnalysisShota Wada
Rizza Caminero
Naoyuki Ishihara
Language Gallery
ResearcherTeruaki Nagasawa
Keiko Kitade
Ryosuke Nakao
Taichi Sotoya
Jun Yokoyama

NOTE: Staffs otherwise noted in parenthesis are faculties and students of Nagaoka University of Technology (NUT).

23:36:00 - Mikami - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=446: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments


Logo Story

Language Observatory logotype
A calligraphic designer from Lebanon, Ms. carla happened to come to GII laboratory in May 2005. Matsuda-sensei introduced her to me because she find some links between Ms. Carla's interest and research topics of our laboratory, especially the Language Observatory project. I must thank to Matsuda-sensei for that.

Lebanon is an old days Phoenicia, and the birth place of original alphabet, Byblos is now located north of the country, and long time I wanted to visit someday in future. Since the first meeting with, I found that Matsuda-sensei's insight is quite right, and we found a lot of common interests. Also I asked her favor to design for us a logotype for the project. And finally she made it for us. Following are the message from the dsigner.

So she came to our laboratory The wind is constantly changing its form and direction. Languages are quite similar. Some evolve while others face extinction. Constantly being influenced, recycled, or personalized. The windmill is on indicators of the wind's direction and nature. It monitors and give evidence. If you are watching, then you are the observers. As long as languages exist, and are in constant change, they should be monitored from a focal point, at a fixed time and place. This is what the language Observatory's main concern is. The logotype was born from this concept.

From the shores of Byblos, and the first letter of the alphabet, to the last letter of the Greek alphabet, the logo is attempting to spek for itself. Is it the letter 'o' or is it 'ω'? Maybe it is spiral oscillation that leads you from one to the other. The answer is not one. YOU hold one; that is, your own language and interpretation.

00:13:00 - Mikami - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=448: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments


Major Events & Milestones


2005/11/19African Web Survey Project discussed and agreed between Academy of African Languages (ACALAN), Linguasphere Observatory and Language Observatory
2005/11/16-18Language Observatory attend WSIS Tunis Phase
2005/11/17MOU signed with Linguasphere Observatory, UK
2005/10/20First version of language Idenfier Module released
2005/05/12Language Observatory and Milano University team preented joint poster at WWW2005, Chiba. MOU signed with Milano University
2005/05/10WSIS UNESCO Thematic Meeting on Multilingualism at bamako, Mali. Prof.Zaki of UTM attended on behalf of the Language Observatory
2005/04/28Prof. K. Machida delivered a lecture on grammatological informatics
2005/04/19Dr. M. Iwahashi's lab members visited LOP
2005/04/15Universiti Teknologi Malaysia (UTM) is assigned as the first Regional Language Observatory in charge for Malaysia and OIC coutries
2005/04/01Dr. Chubachi joined NUT/LOP team
2005/03/28Mr. Göndri Nagy János from Miskolc University joined NUT/LOP team
2005/02/24-25LOWS Technical Meeting at Oginoshima in deep snow
2005/02/21-23The Second Language Observatory Workshop (LOWS) held at TUFS campus, Fuchu, Tokyo


2004/12/23WE receiveda the first serious claim from an American online joural pulisher
2004/11/xxGII servers' memories and HDD augmented
2004/10/23Big earthquake hit our campus and Nagaoka region
2004/10/13A delegation of Language Observatory (Y. Mikami & V. Narayanan) visited Malaysian MIMOS
2007/07/09-23Test crawling on OIC, India and Indochina, successfully collected 25,762,053 unique pages
2004/07/xxDr. Zavarsky visited USA to discuss with Internet Archive, Basis Technology and others
2004/06/xxThree Milano University team members visited NUT and brought us powerful and reliable "UbiCrawler"
2004/04/XXDr. Zavarsky visited Milano to discuss collaboration with Università degli Studi di Milano (USM)
2004/04/15Twenty servers were installed at the observatory
2004/03/XXThe Language Observatory project received an official support letter from UNESCO.
2004/02/20-21The First Language Observatory Workshop (FLOWS 2004) was held in Nagaoka, Japan with attendance of UNESCO representative. The observatory officially commenced operation. (Media coverage on FLOWS)
2004/02/17Japanese National Commission for UNESCO committed official support to FLOWS 2004.


2003/10/01Official LOP website was created.
2003/09/18The Language-Observatory project was selected by Japan Science and Technology Agency (JST) as one of its RISTEX program.
2003/06/xxThe Language Observatory project proposed to Japan Science and technology Agency (JST)
... see more for LOP pre-history

14:53:07 - Mikami - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=276: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments


Language Observatory pre-history::: SEARCC/SRIG-MLC

2002 AugustA pass-breaking language identification technique, "Shift-Codon-Matching" published on ACM/TALIP journal
2002 XXXXThe third version website was designed by I. Suzuki
2002 MarchA preliminary Cyber Census Experiment initiated by Y. Mikami, I.Suzuki, Y. Chubachi, V. Narayanan & D. Rao
2002/02/05Web Page Distribution by Laguage and by Domain: East/South Asia -- Estimates Searched by Google --, by Y. Mikami
2001 NovemberThe Cyber Census project was first openly discussed ar SEARCC/SRIG-MLC meeting , Auckland, New Zealand
2001 MarchThe secnd version of the Multilingual Computing Resource website launched
2000/11/28SEARCC/MLC Terms of Reference established during the second face-to-face meeting , Manila, Philippines
2000/11/13Version 1.0 was published on the net
2000 MayThe first version of Multilingual Computing Resource website was circulated among SRIG-MLC members
1999 DecemberThe first SRIG-MLC meeting was held at SEARCC1999, Singapore.
1998/XX/XXThe initial idea of SRIG-MLC was proposed by Prof. Ohiwa at SEARCC/EXCO meeting

for more recent events of the Language Observatory Project(LOP), see LOP Official Logbook

16:18:00 - Mikami - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=277: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments