Language Observatory

2010-02-18

G2LI - Language Identification for Web page and Text file

G2LI (Global Information Infrastructure Labroatory's Language Identifier) has been released to the public. A simple web interface to G2LI has been published on GII's Web Application Server. User can utilize the service to identify the Language, Script and Encoding system (LSE) of a Web document. The system are able to process TEXT and HTML document.

For language identification, input the URL or upload a local txt/html file to G2LI-Web Application

 

00:22:27 - ycchew - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=1125: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

2009-11-06

Carribean Web Domains Survey

Today we started the web crawling for Carribean Web Domains. The web crawling is started few times before but always interrupted by the campus Refurbishment works. Finally, we hope this round it can go on well without those interruption...

 

Detail of web crawling

carribean09.seed file

carribean09.cctld file

Crawlling settings:

maxdepth=16

maxurlsperhost=10000

Crawling process monitoring

Contact e-mail : s 0 7 7 0 0 3 @ i c s . n a g a o k a u t . a c . j p

For web-master. To stop LOP's crawling :

UbiCrawler supports the Robot Exclusion Standard. if you want to exlcude your site from being crawled by UbiCrawler see The Web Robots Pages.

Briefly, you can put this robots.txt file at the root of the web server you want to exclude from the crawling.

15:53:37 - ycchew - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=1122: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

2009-01-01

Asia-NoCJK web crawling

Crawl experiment: Asia-NoCJK web crawling

CCTLD domain

ae
af
az
bd
bh
bn
bt
cy
id
il
in
iq
ir
jo
kg
kh
kw
kz
la
lb
lk
mm
mn
mv
my
np
om
pg
ph
pk
ps
qa
sa
sg
sy
th
tj
tm
tp
tr
uz
vn
ye


System locks : 22-40

Max depth : 8

Max URLs per host : 1,000

URL delay : 10,000 ms

Crawler name : UbiCrawler/v0.4beta (http://gii.nagaokaut.ac.jp/~ubi/)

Contact e-mail : s077003@ics . n a g a o k a u t . a c . j p

For web-master. To stop LOP's crawling :

UbiCrawler supports the Robot Exclusion Standard. if you want to exlcude your site from being crawled by UbiCrawler see The Web Robots Pages.

Briefly, you can put this robots.txt file at the root of the web server you want to exclude from the crawling.

To monitor network traffic : http://gii2.nagaokaut.ac.jp/~ycchew/php/phpViewRrdGraph.php?rrdgraph=netTraffic&duration=day&btnSubmit=Submit

General status of Asia-NoCJK crawling can be view at here.

11:45:37 - ycchew - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=1076: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

2008-12-12

Web crawling on online Thai newspaper

Today, LOP started a crawl experiment for the following Thain newspaper web sites:

http://www.bangkokbiznews.com/
http://www.matichon.co.th/khaosod/
http://www.komchadluek.com/
http://www.thannews.th.com/
http://www.dailynews.co.th/web/html/home/
http://www.thairath.co.th/
http://www.matichon.co.th/prachachat/prachachat.php
http://www.manager.co.th/
http://www.matichon.co.th/matichon/
http://www.siamsport.co.th/home.html
http://www.siamturakij.com/

The web crawler used is Heritrix, following are the settings:

number of crawling agent: 1

maximum crawling time: 1 week

max-hops: 20

max-path-depth: 20

user agent: Mozilla/5.0 (compatible; heritrix/1.14.2 +http://gii.nagaokaut.ac.jp)

crawler admin email: yewchoong @ yahoo . com

17:14:52 - ycchew - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=1074: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

2007-10-23

LOP Zemi Info!

Untitled Document

VENUE: Meeting Room No 602, 6th Floor, Synthetic Research Building

TIME: Usually Every Tuesday, 240pm (New Time, Effective from 22nd May)

DATE, DAY & TIME PRESENTER(S) TOPICS NOTE
2007, September 11th, Tuesday @ 4:30-6:30pm
  • Mikami Sensei
  • New Lab Members
  • Pann Yuu Mon
  • Ali Hamed
  • First Seminar & Meet Together
  • Self Introduction
  • Myanmar Language Crawler
  • Public-Private Partnership Model
If time permits, we'll try to have more presenters
2007, October 09th, Tuesday @ 4:30-6:30pm
  • Zin Maung Maung
  • Takahashi Tomoe
  • Mikami Sensei
  • Myanmar Syllable Breaking
  • Terminology Dictionary
  • NEW PROJECT BRIEFING! Country Domain Vulnerability
 
2007, October 23rd, Tuesday @ 3:00-4:30pm
  • Chew Yew Choong
  • Mikami Sensei
 
2007, November 02nd, Friday @ 1:00-2:30pm
  • Pan Yuu Mon
  • Myanmar Language Crawler
 
2007, November 09th, Friday @ 1:00-2:30pm
  • Suzuki Sensei
  • Theoretical Background on LI
 
2007, November 13th, Tuesday @ 3:00-4:30pm
  • Umarjan Osman
  • Uighyur Language
Zemi Chair: Ashu Sensei
2007, November 20th, Tuesday @ 3:00-4:30pm
  • Mr Kobayashi Tatsuo, ISO IEC JTC1/SC2
  • Umarjan Osman

 

  • To be informed
  • Uighyur Language

 

 
       

LOP SEMINAR MEMBERS

  1. Mikami Sensei
  2. Kodama Sensei
  3. Ashu Sensei
  4. Suzuki Sensei
  5. Chew Yew Choong [D1]
  6. Ishihara Naoyuki [M2]
  7. Koda Keisuke [M2]
  8. Hoshino Tetsuya [M2]
  9. Matsui Masashi [M2]
  10. Umarjan Usman [M1]
  11. Pann Yu Mon [M2-Sept]
  12. Zin Maung Maung [M2-Sept]
  13. Noel [M1-Sept]
  14. Tharu [D1-Sept]
  15. Mohd Zaidi [D3-Dec]
17:58:00 - zaidi - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=915: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

2007-07-23

SPRING 2007 LOP SEMINAR SCHEDULE

Untitled Document

VENUE: Meeting Room No 602, 6th Floor, Synthetic Research Building

TIME: Usually Every Tuesday, 240pm (New Time, Effective from 22nd May)

DATE, DAY & TIME PRESENTER(S) TOPICS NOTE
2007, July 24th, Tuesday @ 2:40pm
  • Prof Jay. R. Rajasekera, International University of Japan (IUJ) <profile>
2007, July 17th, Tuesday @ 2:40pm
  • Ho Viet Nga
  • Kodama Sensei
  • Masters Post-presentation
  • Linguist view on Language Classification
 
2007, July 10th, Tuesday @ 1:30-5:30pm
  • Chew-san
  • Kouda-san
  • Rizza-san
  • Arai-san
  • Prof Machida & Takashima Sensei from TUFS
  • GII2LI : Language Identifier, take two
  • Problems and Solutions of Identification of multibyte UTF-8
  • Link Structure of Web Comm. in SEA web
  • Language identification method with crawling specified language
  • TUFS presentation
Special Guest & LOP member
2007, July 9th, Monday @ 3:00-5:00pm
  • Kawamura Sensei from Tokyo International University
  • Presentation on Reading Tutor
Special Guest
2007, July 03rd, Tuesday @ 2:30pm
  • Kodama Sensei
  • Tutorial on AAA+ System
Please download "AAA+" and Install before this tutorial!
2007, June 26th, Tuesday @ 2:40pm
  • Koda Keisuke
  • Zin Maung Maung
  • Presentation of Updated Research Topic
  • Finite State Automaton
 
2007, June 19th, Tuesday @ 2:40pm
  • Takahashi Tomoe
  • Pann Yu Mon
  • Engineering Terminology Project
  • Presentation of Updated Research Topic
 
2007, June 12th, Tuesday @ 2:40pm Due to Measles (麻疹) epidemic, NUT will be closed from 12th until 15th June 2007 No LOP Zemi today!
2007, June 6th, Wednesday @ 2:00pm
  • Robin Sensei
  • Tutorial on Writing in English
This Special Tutorial starts at 2.00pm
2007, June 5th, Tuesday @ 2:40pm
  • Zin Maung Maung
  • Pann Yu Mon
  • Word Breaking of Myanmar Text
 
2007, May 29th, Tuesday @ 2:40pm
  • Mr. Seck Thiam Ho
  • Koda Keisuke
  • Good Programming Practice
  • Generating N-gram charts (N=1,2,3)
 
2007, May 22nd, Tuesday @ 2:40pm
  • Mikami Sensei

 

  • World Language Year 2008
  • Second Life
  • ACCU Project
  • Multilingual Dictionary
  • IJCNLP 2008
New Time 2:40pm effective from today
2007, May 15th, Tuesday @ 4:30pm
  • Mikami Sensei
  • Ho Viet Nga
  • Language distance
  • Potential Vietnamese workforce in IT
 
2007, May 8th, Tuesday @ 4:30pm
  • Rizza Caminero
  • Language Subgraph of South East Asian Web
 
2007, May 1st, Tuesday @ 4:30pm     No Zemi Today!
2007, April 24th, Tuesday @ 4:30pm
  • Pann Yu Mon
  • Zin Maung Maung
  • Presentation of Proposed Research Topic
 
2007, April 17th, Tuesday @ 4:30pm
  • Mikami Sensei
  • Chew Yew Choong
  • Misinterpretation of LIM results and its possible reasons
  • Overall GII management, project implementation and suggestions to support members activities through collaborative systems
 
2007, April 06th, Friday @ 2:00pm
  • Mikami Sensei
  • First Seminar & Meet Together
 

LOP SEMINAR MEMBERS

  1. Mikami Sensei
  2. Kodama Sensei
  3. Ashu Sensei
  4. Suzuki Sensei
  5. Chew Yew Choong [D1]
  6. Ho Viet Nga [M2-Sept]
  7. Rizza Caminero [M2-Sept]
  8. Ishihara Naoyuki [M2]
  9. Koda Keisuke [M2]
  10. Hoshino Tetsuya [M2]
  11. Matsui Masashi [M2]
  12. Umarjan Usman [M1]
  13. Pann Yu Mon [M1-Sept]
  14. Zin Maung Maung [M1-Sept]
  15. Mohd Zaidi [D3-Dec]

 

18:23:00 - zaidi - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=800: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

2007-06-05

Tree of Languages Relationship and the Language Similarity Table

Amazing image! The Tree of Languages was created from a 3-gram table using the neighbor-joining algorithm of Saitou and Nei.

SOURCES: Corpus building for minority languages by Kevin P. Scannell
12:45:53 - ycchew - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=873: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

2007-05-28

Hungary web crawling completed.

Crawl experiment: Hungary web crawling

CCTLD domain : .hu

Seed URLs : http://iwiw.hu/ http://google.co.hu/ http://freemail.hu/ http://t-online.hu/ http://origo.hu/ http://index.hu/ http://gportal.hu/ http://extra.hu/ http://citromail.hu/ http://lap.hu/ http://goldengate.hu/ http://port.hu/ http://love.hu/ http://uw.hu/ http://freeweb.hu/ http://nemzetisport.hu/ http://nlcafe.hu/ http://startlapjatekok.hu/ http://vatera.hu/ http://wiw.hu/ http://atw.hu/ http://hotdog.hu/ http://tx.hu/ http://sztaki.hu/ http://teszvesz.hu/ http://teveclub.hu/ http://sg.hu/ http://videa.hu/ http://expressz.hu/ http://ok.hu/ http://hirkereso.hu/ http://rtlklub.hu/ http://vipmail.hu/ http://videoplayer.hu/ http://videobomb.hu/http://honfoglalo.hu/ http://blog.hu/ http://ingyenfilmek.hu/ http://kapu.hu/ http://fw.hu/ http://freeblog.hu/ http://ebnevelde.hu/ http://blikk.hu/ http://tango.hu/ http://velvet.hu/ http://try.hu/ http://tar.hu/ http://pina.hu/ http://szon.hu/ http://kepfeltoltes.hu/ http://hirstart.hu/ http://totalcar.hu/ http://protect.hu/ http://gov.hu/ http://ppl.hu/ http://tvn.hu/ http://dyn.hu/ http://jamba.hu/ http://top70.hu/ http://partyphoto.hu/ http://t-mobile.hu/ http://fn.hu/ http://blogter.hu/ http://tmxd.hu/ http://animeaddicts.hu/ http://randivonal.hu/ http://im-net.hu/ http://travian.hu/ http://startlap.hu http://www.boon.hu http://www.bme.hu http://www.lendulet.hu/ http://www.budapest.hu/http://www.palya.hu/ http://www.music.hu/ http://www.hrportal.hu/ http://www.panorama.shp.hu/ http://programmagazin.hu/ http://chat.hu/ http://sulinet.hu/

System locks : 1-40

Max depth : 30

Max URLs per host : 40,000

URL delay : 10,000 ms

Crawler name : UbiCrawler/v0.4beta (http://gii.nagaokaut.ac.jp/~ubi/)

Contact e-mail : 0 5 5 9 1 9 @ m i s . n a g a o k a u t . a c . j p

For web-master. To stop LOP's crawling :

UbiCrawler supports the Robot Exclusion Standard. if you want to exlcude your site from being crawled by UbiCrawler see The Web Robots Pages.

Briefly, you can put this robots.txt file at the root of the web server you want to exclude from the crawling.

To monitor network traffic : http://gii2.nagaokaut.ac.jp/~chew/php/phpViewRrdGraph.php?rrdgraph=netTraffic&duration=day&btnSubmit=Submit

General status of Hungary crawling can be view at here.
15:34:51 - ycchew - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=869: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

2007-05-01

Data backup!

After two weeks of times on cleaning, sorting and moving. As at today, 2007-05-01, all existing data used for LOP research has been backup to EMC storage. Below is the summary of the data:

crawl experiments:
------------------
africa-051208-1556
africa-060606-2324
africa2-060113-1510
aosis-ot-060907-1638
asia-NOcjk-060705-1100
india-060628-1319
india-utf8-060126-1805
india-utf8-060127-0113
kh-la-060222-1645
lk-060328-1511
niigata-060224-1014
oceania-060306-1743
oceania-060307-1816
oic-041120-0129

crawl data size:
----------------
15G /emcpowera2/gii
187G /emcpowera2/gii-pc2
53G /emcpowera2/gii-pc3
144G /emcpowera2/gii-pc4
149G /emcpowera2/gii-pc5
188G /emcpowera2/gii-pc6
176G /emcpowera2/gii-pc7
218G /emcpowera2/gii-pc8
166G /emcpowera2/gii-pc9
127G /emcpowera1/gii-pc10
119G /emcpowera2/gii-pc11
100G /emcpowera1/gii-pc12
108G /emcpowera1/gii-pc13
124G /emcpowera1/gii-pc14
121G /emcpowera1/gii-pc15
118G /emcpowera1/gii-pc16
96G /emcpowera1/gii-pc17
87G /emcpowera1/gii-pc18
4.0K /emcpowera1/gii-pc19
109G /emcpowera1/gii-pc20
55G /emcpowera1/gii-pc21
60G /emcpowera1/gii-pc22
1.8G /emcpowera1/gii-pc23
62G /emcpowera1/gii-pc24
28G /emcpowera1/gii-pc25
62G /emcpowera1/gii-pc26
26G /emcpowera1/gii-pc27
58G /emcpowera1/gii-pc28
56G /emcpowera1/gii-pc29
64G /emcpowera1/gii-pc30
50G /emcpowera2/gii-pc31
64G /emcpowera2/gii-pc32
60G /emcpowera2/gii-pc33
52G /emcpowera2/gii-pc34
58G /emcpowera2/gii-pc35
59G /emcpowera2/gii-pc36
26G /emcpowera1/gii-pc37
60G /emcpowera1/gii-pc38
57G /emcpowera2/gii-pc39
56G /emcpowera1/gii-pc40
1.6T /emcpowera1/
1.8T /emcpowera2/

After this backup, the EMC storage is almost full. A new storage will be install soon. Anyone who willing to utilize this data, kindly contact the crawler administrator.
12:14:50 - ycchew - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=823: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments

2007-04-16

5 steps for web language survey

New LOP members always asked what is necessary to perform a language observatory process. Basically, there are just as little as 5 major steps:

1. Preparation:

1.1 Prepare at least 200 URLs (which usually refers as Seed URL) for the country domain you are interest in it. If you are interest in a bigger scope, say for a region, then consider to use at least 50~100 URLs for each country in that region.

1.2 Prepare a complete set of UDHR (Universal Declaraton of Human Rights) document in various Language+Script+Encoding combination to be use as "Teacher" text to train the LI (language identifier).

2. Web crawling:

Once Seed URL is ready, configure and start the web crawler. Monitor web crawler status to decide whether you need to manually terminate a crawling process.

3. Language identification:

After crawling, activate the LI to perform batch identification process against newly downloaded web documents.

4. Result:

From identifier output, consolidate and verify the result.

5. Publish report:

Finally, publish your result.
15:57:17 - ycchew - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=808: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments