Language Observatory
2005-11-27
What is "Language Observatory"?
Astronomical observatory surveys stars in the universe. Likewise, the Language Observatory surveys language activities in the virtual universe over the Internet. The former catches the weakest light from stars. Likewise, the Language Observatory tries to catch subtle messages of less spoken languages, as far as they appear on the Internet, and answers such questions like;
- How many languages are found on the Internet?
- How many web pages are written by any given language/script under specific country code domain (ccTLD)?
- What kind of character encoding schemes (CESs) are employed to encode a given language?
- How quickly UCS/Unicode is spreading?
- To what extent open-source software (OSS) technologies are employed by specific language community?
- How specific language community is linked together with other language communities? (web-graph analysis)
---------------------------------------------------------
"In the galaxy of languages, every word is a star."... UNESCO
---------------------------------------------------------
To read more
"about us"
1.
Objectives of the Language Observatory
2.
Who runs the Language Observatory?
3.
How the Language Observatory actually works?
4.
What kind of reports the Language Observatory will produce?
5.
Language Observatory Project (LOP) Milestones
6.
Language Observatory Staff
7.
International Advisors
8.
Logo Story
20:24:06 -
Mikami -
mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=439: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed
No comments
Objectives of the Language Observatory
Although over 6,000 languages are currently spoken on the globe (see footnote [1]), only a few of them has been properly represented in the virtual universe of the Internet. We call this situation "Digital Divide among Languages" or just "
Digital Language Divide" (see
another blog article). We share the same concerns with UNESCO on this point when the latter stressing the importance of "the preservation of a balanced use of languages in cyberspace". Here, objectives of the Language Observatory Project can be stated as;
- To raise public awareness on "Digital Language Divide" issues
- To encourage support to the processing of those languages now falling through the net.
footnote
[1] Although experts estimate there are more than 6,000 languages on the globe, the project will try to cover 300+ languages which appear at
Alphabetical listing of all translations of the Universal Declaration of Human Rights (UDHR). This coverage is still far beyond the scope of currently available commercial search engines.
20:23:51 -
Mikami -
mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=440: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed
No comments
Who runs the Language Observatory?
The project is currently funded by Japan Science and Technology Agency (JST)under RISTEX program, and is implemented by the partnership of several institutions:
- Nagaoka University of Technology (NUT), Japan
- Keio University, Japan
- Tokyo University of Foreign Studies (TUFS), Japan
- Fakulti Sains Komputer & Sistem Maklumat (FSKSM), Universiti Teknologi Malaysia (UTM), Malaysia
- Thai Computational Linguistic Laboratory (TCL), Thailand
- Miskolc University, Hungary
- Technology Development of Indian Languages (TDIL), Ministry of IT, India
- The Laboratory for Web Algorithmics (LAW), Dipartimento di Scienze dell'Informazione, Università degli Studi di Milano (USM), Milano, Italy
20:23:34 -
Mikami -
mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=441: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed
No comments
How the Language Observatory actually works?
The Language Observatory works through the following steps.
- Crawler Robots visit pages on the Internet at least once a year, and fetch text content. These robots return back to the same page regularly so as to produce a periodical report.
- Language Identification Module (LIM) analyses the page content and identifies language property (language, script and character encoding scheme, etc.) of the page. LIM is trained by the contribution of language experts.
- The Observatory counts up number of pages according to their language properties and compiles a regular report.
- The Observatory also analises HTML-tag information and link information to reveal open-source usage status, web graph structure, etc.
20:23:18 -
Mikami -
mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=442: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed
No comments
What kind of reports will be produced?
The observatory regularly publishes the following reports.
1. The Cyber Census Report:
This report reveals the activity level of all the languages observable on the web pages of the Internet. It contains statistics on;
--- number of pages and bytes by language
--- number of pages and bytes by script
--- number of pages and bytes by character encoding scheme (CES)
--- relative share in the cyberspace by language
The report, we believe, will reveal actual usage of languages in the virtual universe, and will help policy makers and international organizations to understand real picture of unbalanced usage of languages in cyberspace.
2. The CES Report:
The report describes what kind of character encoding scheme is employed to encode each languages and scripts. Although most of the pages on the Internet are encoded by widely known CESs, like ASCII or ISO/IEC 8859 series for Latin script, ASMO 449 for Arabic script, JIS or ISO-2022-JP for Japanese script, and so on, some pages are encoded by locally developed "fonts" to represent local language. These locally developed fonts are not just a font, but a kind of implicit character encoding scheme. So we call this "Implicit CES". Even after the emergence of Multi-Octet Character Set ISO/IEC 10646, which covers almost all the scripts currently used in the world, still many Implicit CESs are under use because of unavailability of OpenType technologies or just for easiness. Here, chaos in CES is invited.
The report, we believe, will help ICT engineers and policy makers to understand the technical problems behind "Digital Language Divide". Technical specifications of observed Implicit CESs will be provided through the report and will be used to develop converters.
3. The Corpus Statistics Report:
The report gives various statistics of text data. Statistics includes;
--- Single byte/character distribution for given CES text
--- Two bytes/characters distribution for given CES text
--- Three bytes/characters distribution for given CES text
--- Single word distribution for given language/script text
20:13:43 -
Mikami -
mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=443: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed
No comments
2005-09-01
Language Observatory Staff
| Project Management | |
| Director | Yoshiki Mikami |
| Co-Director (system) | Pavol Zavarsky |
| Co-Director (linguistic) | Makoto Minegishi (TUFS) |
| Co-Director (linguistic) | Kazuhiko Machida (TUFS) |
| Coordinator | Tomoe Takahashi |
| Coordinator | Jun Sugawara (TUFS/JST) |
| International Liaison | V. Narayanan (Excel Solutions, Singapore) |
| S. T. Nandasara (Univ. of Colombo, Sri Lanka) |
| LI Manager | János Göndri Nagy (NUT/Miskolc Univ., Hungary) |
| Crawl Manager | Chew Yew Choong |
| SeedURL Manager | Tomoe Takahashi |
| System Administrator | Chew Yew Choong |
| Junichiro Chisuga |
| Graphic Design | Carla Salem (Nagaoka Institute of Design) |
| |
| Server/Network System | |
| System Design | Yoshihide Chubachi (NUT/JST, Keio Univ.) |
| Network Design | Katsuko T. Nakahira |
| Masayuki Takahashi |
| Tomohide Maki |
| Chew Yew Choong |
| Keisuke Koda |
| Language Identification | |
| System Design | Yoshihide Chubachi (NUT/JST, Keio Univ.) |
| Algorithm Design | Izumi Suzuki |
| Researcher | János Göndri Nagy (NUT/Miskolc Univ., Hungary) |
| Wunna Ko Ko |
| Shota Wada |
| Rizza Caminero |
| Keisuke Koda |
| |
| Data Analysis | |
| Contents Analysis | Mohd Zaidi abd Rozan (NUT/UTM) |
| Masayuki Takahashi |
| Network Analysis | Katsuko T. Nakahira |
| Tetsuya Hoshino |
| Language Analysis | János Göndri Nagy (NUT/Miskolc Univ., Hungary) |
| Robin Nagano Lee (Miskolc Univ., Hungary) |
| Charset Analysis | János Göndri Nagy (NUT/Miskolc Univ., Hungary) |
| Ryosuke Nakao |
| Web Graph Analysis | Shota Wada |
| Rizza Caminero |
| Naoyuki Ishihara |
| |
| Language Gallery | |
| Researcher | Teruaki Nagasawa |
| Keiko Kitade |
| Ryosuke Nakao |
| Taichi Sotoya |
| Jun Yokoyama |
| |
NOTE: Staffs otherwise noted in parenthesis are faculties and students of Nagaoka University of Technology (NUT).
23:36:00 -
Mikami -
mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=446: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed
No comments
2005-08-16
Logo Story
A calligraphic designer from Lebanon, Ms. carla happened to come to GII laboratory in May 2005. Matsuda-sensei
introduced her to me because she find some links between Ms. Carla's interest and research topics of our laboratory, especially the Language Observatory project. I must thank to Matsuda-sensei for that.
Lebanon is an old days Phoenicia, and the birth place of original alphabet, Byblos is now located north of the country, and long time I wanted to visit someday in future. Since the first meeting with, I found that Matsuda-sensei's insight is quite right, and we found a lot of common interests. Also I asked her favor to design for us a logotype for the project. And finally she made it for us. Following are the message from the dsigner.
So she came to our laboratory
The wind is constantly changing its form and direction. Languages are quite similar. Some evolve while others face extinction. Constantly being influenced, recycled, or personalized. The windmill is on indicators of the wind's direction and nature. It monitors and give evidence. If you are watching, then you are the observers. As long as languages exist, and are in constant change, they should be monitored from a focal point, at a fixed time and place. This is what the language Observatory's main concern is. The logotype was born from this concept.
From the shores of Byblos, and the first letter of the alphabet, to the last letter of the Greek alphabet, the logo is attempting to spek for itself. Is it the letter 'o' or is it 'ω'? Maybe it is spiral oscillation that leads you from one to the other. The answer is not one. YOU hold one; that is, your own language and interpretation.
00:13:00 -
Mikami -
mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=448: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed
No comments
2005-05-01
Major Events & Milestones
2005
| DATE | EVENT |
| 2005/11/19 | African Web Survey Project discussed and agreed between Academy of African Languages (ACALAN), Linguasphere Observatory and Language Observatory |
| 2005/11/16-18 | Language Observatory attend WSIS Tunis Phase |
| 2005/11/17 | MOU signed with Linguasphere Observatory, UK |
| 2005/10/20 | First version of language Idenfier Module released |
| 2005/05/12 | Language Observatory and Milano University team preented joint poster at WWW2005, Chiba. MOU signed with Milano University |
| 2005/05/10 | WSIS UNESCO Thematic Meeting on Multilingualism at bamako, Mali. Prof.Zaki of UTM attended on behalf of the Language Observatory |
| 2005/04/28 | Prof. K. Machida delivered a lecture on grammatological informatics |
| 2005/04/19 | Dr. M. Iwahashi's lab members visited LOP |
| 2005/04/15 | Universiti Teknologi Malaysia (UTM) is assigned as the first Regional Language Observatory in charge for Malaysia and OIC coutries |
| 2005/04/01 | Dr. Chubachi joined NUT/LOP team |
| 2005/03/28 | Mr. Göndri Nagy János from Miskolc University joined NUT/LOP team |
| 2005/02/24-25 | LOWS Technical Meeting at Oginoshima in deep snow |
| 2005/02/21-23 | The Second Language Observatory Workshop (LOWS) held at TUFS campus, Fuchu, Tokyo |
2004
| DATE | EVENT |
| 2004/12/23 | WE receiveda the first serious claim from an American online joural pulisher |
| 2004/11/xx | GII servers' memories and HDD augmented |
| 2004/10/23 | Big earthquake hit our campus and Nagaoka region |
| 2004/10/13 | A delegation of Language Observatory (Y. Mikami & V. Narayanan) visited Malaysian MIMOS |
| 2007/07/09-23 | Test crawling on OIC, India and Indochina, successfully collected 25,762,053 unique pages |
| 2004/07/xx | Dr. Zavarsky visited USA to discuss with Internet Archive, Basis Technology and others |
| 2004/06/xx | Three Milano University team members visited NUT and brought us powerful and reliable "UbiCrawler" |
| 2004/04/XX | Dr. Zavarsky visited Milano to discuss collaboration with Università degli Studi di Milano (USM) |
| 2004/04/15 | Twenty servers were installed at the observatory |
| 2004/03/XX | The Language Observatory project received an official support letter from UNESCO. |
| 2004/02/20-21 | The First Language Observatory Workshop (FLOWS 2004) was held in Nagaoka, Japan with attendance of UNESCO representative. The observatory officially commenced operation. (Media coverage on FLOWS) |
| 2004/02/17 | Japanese National Commission for UNESCO committed official support to FLOWS 2004. |
2003
| DATE | EVENT |
| 2003/10/01 | Official LOP website was created. |
| 2003/09/18 | The Language-Observatory project was selected by Japan Science and Technology Agency (JST) as one of its RISTEX program. |
| 2003/06/xx | The Language Observatory project proposed to Japan Science and technology Agency (JST) |
... see more for
LOP pre-history
14:53:07 -
Mikami -
mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=276: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed
No comments
2004-01-10
Language Observatory pre-history::: SEARCC/SRIG-MLC
| DATE | |
|---|
| 2002 August | A pass-breaking language identification technique, "Shift-Codon-Matching" published on ACM/TALIP journal |
| 2002 XXXX | The third version website was designed by I. Suzuki |
| 2002 March | A preliminary Cyber Census Experiment initiated by Y. Mikami, I.Suzuki, Y. Chubachi, V. Narayanan & D. Rao |
| 2002/02/05 | Web Page Distribution by Laguage and by Domain: East/South Asia -- Estimates Searched by Google --, by Y. Mikami |
| 2001 November | The Cyber Census project was first openly discussed ar SEARCC/SRIG-MLC meeting , Auckland, New Zealand |
| 2001 March | The secnd version of the Multilingual Computing Resource website launched |
| 2000/11/28 | SEARCC/MLC Terms of Reference established during the second face-to-face meeting , Manila, Philippines |
| 2000/11/13 | Version 1.0 was published on the net |
| 2000 May | The first version of Multilingual Computing Resource website was circulated among SRIG-MLC members |
| 1999 December | The first SRIG-MLC meeting was held at SEARCC1999, Singapore. |
| 1998/XX/XX | The initial idea of SRIG-MLC was proposed by Prof. Ohiwa at SEARCC/EXCO meeting |
for more recent events of the Language Observatory Project(LOP), see
LOP Official Logbook
16:18:00 -
Mikami -
mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=277: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed
No comments