Language Observatory

2010-02-18

G2LI - Language Identification for Web page and Text file

G2LI (Global Information Infrastructure Labroatory's Language Identifier) has been released to the public. A simple web interface to G2LI has been published on GII's Web Application Server. User can utilize the service to identify the Language, Script and Encoding system (LSE) of a Web document. The system are able to process TEXT and HTML document.

For language identification, input the URL or upload a local txt/html file to G2LI-Web Application

 

00:22:27 - ycchew - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=1125: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

2009-11-06

Carribean Web Domains Survey

Today we started the web crawling for Carribean Web Domains. The web crawling is started few times before but always interrupted by the campus Refurbishment works. Finally, we hope this round it can go on well without those interruption...

 

Detail of web crawling

carribean09.seed file

carribean09.cctld file

Crawlling settings:

maxdepth=16

maxurlsperhost=10000

Crawling process monitoring

Contact e-mail : s 0 7 7 0 0 3 @ i c s . n a g a o k a u t . a c . j p

For web-master. To stop LOP's crawling :

UbiCrawler supports the Robot Exclusion Standard. if you want to exlcude your site from being crawled by UbiCrawler see The Web Robots Pages.

Briefly, you can put this robots.txt file at the root of the web server you want to exclude from the crawling.

15:53:37 - ycchew - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=1122: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

2009-01-01

Asia-NoCJK web crawling

Crawl experiment: Asia-NoCJK web crawling

CCTLD domain

ae
af
az
bd
bh
bn
bt
cy
id
il
in
iq
ir
jo
kg
kh
kw
kz
la
lb
lk
mm
mn
mv
my
np
om
pg
ph
pk
ps
qa
sa
sg
sy
th
tj
tm
tp
tr
uz
vn
ye


System locks : 22-40

Max depth : 8

Max URLs per host : 1,000

URL delay : 10,000 ms

Crawler name : UbiCrawler/v0.4beta (http://gii.nagaokaut.ac.jp/~ubi/)

Contact e-mail : s077003@ics . n a g a o k a u t . a c . j p

For web-master. To stop LOP's crawling :

UbiCrawler supports the Robot Exclusion Standard. if you want to exlcude your site from being crawled by UbiCrawler see The Web Robots Pages.

Briefly, you can put this robots.txt file at the root of the web server you want to exclude from the crawling.

To monitor network traffic : http://gii2.nagaokaut.ac.jp/~ycchew/php/phpViewRrdGraph.php?rrdgraph=netTraffic&duration=day&btnSubmit=Submit

General status of Asia-NoCJK crawling can be view at here.

11:45:37 - ycchew - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=1076: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

2008-12-12

Web crawling on online Thai newspaper

Today, LOP started a crawl experiment for the following Thain newspaper web sites:

http://www.bangkokbiznews.com/
http://www.matichon.co.th/khaosod/
http://www.komchadluek.com/
http://www.thannews.th.com/
http://www.dailynews.co.th/web/html/home/
http://www.thairath.co.th/
http://www.matichon.co.th/prachachat/prachachat.php
http://www.manager.co.th/
http://www.matichon.co.th/matichon/
http://www.siamsport.co.th/home.html
http://www.siamturakij.com/

The web crawler used is Heritrix, following are the settings:

number of crawling agent: 1

maximum crawling time: 1 week

max-hops: 20

max-path-depth: 20

user agent: Mozilla/5.0 (compatible; heritrix/1.14.2 +http://gii.nagaokaut.ac.jp)

crawler admin email: yewchoong @ yahoo . com

17:14:52 - ycchew - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=1074: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

2008-09-03

Rosetta Project

Rosetta stone found in 1799 during the Napoleon's campaign in Egypt is the keystone to decipher Ancient Egyptian scripts. It was dated 196 B.C. and three kinds of glyphs, Greek, Demotic and Hieroglyphic, of a single passage were curved on the stone. Comparative translation of them assisted in understanding the writing system and structure of the Egyptian scripts. Now Rosetta stone is used as idiomatic words for a process of decryption or translation of a difficult problem.

Now we have another Rosetta stone, named Rosetta Project, to archive all the human languages spoken on the earth and build a publicly accessible digital library of human languages. Now it serves nearly 100,000 pages of material documenting over 2,500 languages. Its main concern is the drastic and accelerated loss of the world's languages. It is just one of our LOP's main concern.

 You can refer and search resources of this project from the URI below.

 http://www.rosettaproject.org/

21:31:52 - kodama - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=1014: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

2007-11-27

Poster session at 135th LSJ conference

On the last Sunday, 25th Nov., Chew-san and I went to Matsumoto to present a poster session at the 135th conference of the Linguistic Society of Japan. This society is the largest one in the linguistic fields in Japan.

Our session concerned language identification and the status of the languages on the Asian and African web. Unfortunately our session was assigned 11:30-13:10, a time for lunch, there were only twenty visitors. But they were interested in our research and asked suggestive questions:

-our research could be applied to an automatic identification of spoken language?

-how our identification engine would be utilized in the engineering fields?

-how is the situation of India? Hindi is widely used or not? (by a linguist of Indic language)

-how is the diachronic transition of the use of languages? how it could be analyzed from the sociolinguistic viewpoint?

A linguist of Indic language (the same person above) said to us that in India, those who can acsess the internet are received the higher educations and are skilled in English, and they have a tendency to use English at the social communications. I guess that the same situation should be exist in the most of Asian and African countries, but the speculation could be supported by the fact.She also said that it was very interesting to her that Bhojopuri, regarded as a dialect of Hindi, appeared in our survey.

The summary of our presentation could be referred to from the following address:

http://wwwsoc.nii.ac.jp/lsj2/meetings/135/abstract/poster104.shtml
15:53:56 - kodama - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=923: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

2007-10-23

LOP Zemi Info!

Untitled Document

VENUE: Meeting Room No 602, 6th Floor, Synthetic Research Building

TIME: Usually Every Tuesday, 240pm (New Time, Effective from 22nd May)

DATE, DAY & TIME PRESENTER(S) TOPICS NOTE
2007, September 11th, Tuesday @ 4:30-6:30pm
  • Mikami Sensei
  • New Lab Members
  • Pann Yuu Mon
  • Ali Hamed
  • First Seminar & Meet Together
  • Self Introduction
  • Myanmar Language Crawler
  • Public-Private Partnership Model
If time permits, we'll try to have more presenters
2007, October 09th, Tuesday @ 4:30-6:30pm
  • Zin Maung Maung
  • Takahashi Tomoe
  • Mikami Sensei
  • Myanmar Syllable Breaking
  • Terminology Dictionary
  • NEW PROJECT BRIEFING! Country Domain Vulnerability
 
2007, October 23rd, Tuesday @ 3:00-4:30pm
  • Chew Yew Choong
  • Mikami Sensei
 
2007, November 02nd, Friday @ 1:00-2:30pm
  • Pan Yuu Mon
  • Myanmar Language Crawler
 
2007, November 09th, Friday @ 1:00-2:30pm
  • Suzuki Sensei
  • Theoretical Background on LI
 
2007, November 13th, Tuesday @ 3:00-4:30pm
  • Umarjan Osman
  • Uighyur Language
Zemi Chair: Ashu Sensei
2007, November 20th, Tuesday @ 3:00-4:30pm
  • Mr Kobayashi Tatsuo, ISO IEC JTC1/SC2
  • Umarjan Osman

 

  • To be informed
  • Uighyur Language

 

 
       

LOP SEMINAR MEMBERS

  1. Mikami Sensei
  2. Kodama Sensei
  3. Ashu Sensei
  4. Suzuki Sensei
  5. Chew Yew Choong [D1]
  6. Ishihara Naoyuki [M2]
  7. Koda Keisuke [M2]
  8. Hoshino Tetsuya [M2]
  9. Matsui Masashi [M2]
  10. Umarjan Usman [M1]
  11. Pann Yu Mon [M2-Sept]
  12. Zin Maung Maung [M2-Sept]
  13. Noel [M1-Sept]
  14. Tharu [D1-Sept]
  15. Mohd Zaidi [D3-Dec]
17:58:00 - zaidi - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=915: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

2007-07-23

SPRING 2007 LOP SEMINAR SCHEDULE

Untitled Document

VENUE: Meeting Room No 602, 6th Floor, Synthetic Research Building

TIME: Usually Every Tuesday, 240pm (New Time, Effective from 22nd May)

DATE, DAY & TIME PRESENTER(S) TOPICS NOTE
2007, July 24th, Tuesday @ 2:40pm
  • Prof Jay. R. Rajasekera, International University of Japan (IUJ) <profile>
2007, July 17th, Tuesday @ 2:40pm
  • Ho Viet Nga
  • Kodama Sensei
  • Masters Post-presentation
  • Linguist view on Language Classification
 
2007, July 10th, Tuesday @ 1:30-5:30pm
  • Chew-san
  • Kouda-san
  • Rizza-san
  • Arai-san
  • Prof Machida & Takashima Sensei from TUFS
  • GII2LI : Language Identifier, take two
  • Problems and Solutions of Identification of multibyte UTF-8
  • Link Structure of Web Comm. in SEA web
  • Language identification method with crawling specified language
  • TUFS presentation
Special Guest & LOP member
2007, July 9th, Monday @ 3:00-5:00pm
  • Kawamura Sensei from Tokyo International University
  • Presentation on Reading Tutor
Special Guest
2007, July 03rd, Tuesday @ 2:30pm
  • Kodama Sensei
  • Tutorial on AAA+ System
Please download "AAA+" and Install before this tutorial!
2007, June 26th, Tuesday @ 2:40pm
  • Koda Keisuke
  • Zin Maung Maung
  • Presentation of Updated Research Topic
  • Finite State Automaton
 
2007, June 19th, Tuesday @ 2:40pm
  • Takahashi Tomoe
  • Pann Yu Mon
  • Engineering Terminology Project
  • Presentation of Updated Research Topic
 
2007, June 12th, Tuesday @ 2:40pm Due to Measles (麻疹) epidemic, NUT will be closed from 12th until 15th June 2007 No LOP Zemi today!
2007, June 6th, Wednesday @ 2:00pm
  • Robin Sensei
  • Tutorial on Writing in English
This Special Tutorial starts at 2.00pm
2007, June 5th, Tuesday @ 2:40pm
  • Zin Maung Maung
  • Pann Yu Mon
  • Word Breaking of Myanmar Text
 
2007, May 29th, Tuesday @ 2:40pm
  • Mr. Seck Thiam Ho
  • Koda Keisuke
  • Good Programming Practice
  • Generating N-gram charts (N=1,2,3)
 
2007, May 22nd, Tuesday @ 2:40pm
  • Mikami Sensei

 

  • World Language Year 2008
  • Second Life
  • ACCU Project
  • Multilingual Dictionary
  • IJCNLP 2008
New Time 2:40pm effective from today
2007, May 15th, Tuesday @ 4:30pm
  • Mikami Sensei
  • Ho Viet Nga
  • Language distance
  • Potential Vietnamese workforce in IT
 
2007, May 8th, Tuesday @ 4:30pm
  • Rizza Caminero
  • Language Subgraph of South East Asian Web
 
2007, May 1st, Tuesday @ 4:30pm     No Zemi Today!
2007, April 24th, Tuesday @ 4:30pm
  • Pann Yu Mon
  • Zin Maung Maung
  • Presentation of Proposed Research Topic
 
2007, April 17th, Tuesday @ 4:30pm
  • Mikami Sensei
  • Chew Yew Choong
  • Misinterpretation of LIM results and its possible reasons
  • Overall GII management, project implementation and suggestions to support members activities through collaborative systems
 
2007, April 06th, Friday @ 2:00pm
  • Mikami Sensei
  • First Seminar & Meet Together
 

LOP SEMINAR MEMBERS

  1. Mikami Sensei
  2. Kodama Sensei
  3. Ashu Sensei
  4. Suzuki Sensei
  5. Chew Yew Choong [D1]
  6. Ho Viet Nga [M2-Sept]
  7. Rizza Caminero [M2-Sept]
  8. Ishihara Naoyuki [M2]
  9. Koda Keisuke [M2]
  10. Hoshino Tetsuya [M2]
  11. Matsui Masashi [M2]
  12. Umarjan Usman [M1]
  13. Pann Yu Mon [M1-Sept]
  14. Zin Maung Maung [M1-Sept]
  15. Mohd Zaidi [D3-Dec]

 

18:23:00 - zaidi - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=800: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

2007-07-05

Installing Language Identification Module (LIM) web application in Windows

LIM web application running inside Apache Tomcat Servlet/JSP Container. Following is the requirements for LIM in Windows paltform.

- Java 2 Standard Edition Runtime Environment (JRE) version 5.0 or later.
- Apache Tomcat 5.0 or above
- Apache Ant 1.6.5 or above

=============================
Installing LIM for Windows
=============================

Unpack LIM.zip to c:\lim

=============================
Running With JDK 5.0 Or Later
=============================

p/s: Some part of the following guide is copy from Tomcat's installation guide.

As Java is the core engine, make sure it is installed.

Install the JDK according to the instructions included with the release.

Set an environment variable named JAVA_HOME to the pathname of the directory into which you installed the JDK, e.g. c:\j2sdk5.0 or /usr/local/java/j2sdk5.0.

=============================
Download and Install Apache Ant
=============================

Download latest Ant binary distributions from http://ant.apache.org/
Unpack and then setup the environment variable named ANT_HOME to where ant was installed, e.g. C:\apache-ant-1.7.0 or /usr/local/apache-ant-1.7.0

=============================
Download and Install the Tomcat Binary Distribution
=============================

Download a binary distribution of Tomcat from http://tomcat.apache.org

Unpack the binary distribution into a convenient location so that the distribution resides in its own directory (conventionally named "apache-tomcat-[version]").

Set an environment variable named CATALINA_HOME to the pathname of the directory into which you installed the Tomcat, e.g. C:\apache-tomcat-6.0.13 or /usr/local/apache-tomcat-6.0.13.

=============================
Install LIM web application to Tomcat
=============================

The are two groups of file to be install.

(1) LIM depends on lot of third part libraries for its function. Thus, copy the following to Tomcat library directory, e.g. copy c:\lim\lib\* $CATALINA_HOME\lib\

List of libraries:
antlr.jar
asm-attrs.jar
asm.jar
cglib-2.1.jar
commons-beanutils.jar
commons-cli-1.0.jar
commons-collections-2.1.1.jar
commons-digester.jar
commons-fileupload.jar
commons-lang-2.1.jar
commons-logging.jar
commons-validator.jar
dom4j-1.6.jar
ehcache-1.1.jar
fastutil-4.4.1.jar
hibernate3.jar
hsqldb.jar
htmlparser.jar
jakarta-oro.jar
jdbc2_0-stdext.jar
jta.jar
log4j-1.2.9.jar
mg4j-0.9.1.jar
struts.jar
utilx-1.2.jar

(2) Copy LIM web application, i.e. trainer.war to Tomcat's webapps directory, e.g. copy c:\lim\trainer.war $CATALINA_HOME\webapps\

=============================
Start LIM database
=============================

Make sure JRE and ANT binary files is in current search path. If not, use the following command to set it up: set PATH=%PATH%;%JAVA_HOME%\bin;%ANT_HOME%\bin

Start database for LIM. You can do this by typing the following in command prompt:
cd c:\lim
db dbstart


=============================
Startup Tomcat
=============================

p/s: If you encounter problem starting LIM web application in Tomcat, try to change its minimum memory pool (at least 128) and maximum memory pool (at least 256).

The following command will increase the memory:
set CATALINA_OPTS="-Xms128m -Xmx256m"

Tomcat can then be started by executing the following commands:
$CATALINA_HOME\bin\startup.bat (for Windows)
Or $CATALINA_HOME/bin/startup.sh (for Unix/Linux)

Now you can access LIm by pointing your browser to http://localhost:8080/trainer
16:32:27 - ycchew - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=892: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

2007-06-05

Tree of Languages Relationship and the Language Similarity Table

Amazing image! The Tree of Languages was created from a 3-gram table using the neighbor-joining algorithm of Saitou and Nei.

SOURCES: Corpus building for minority languages by Kevin P. Scannell
12:45:53 - ycchew - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=873: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments