Language Observatory

2009-01-01

Asia-NoCJK web crawling

Crawl experiment: Asia-NoCJK web crawling

CCTLD domain

ae
af
az
bd
bh
bn
bt
cy
id
il
in
iq
ir
jo
kg
kh
kw
kz
la
lb
lk
mm
mn
mv
my
np
om
pg
ph
pk
ps
qa
sa
sg
sy
th
tj
tm
tp
tr
uz
vn
ye


System locks : 22-40

Max depth : 8

Max URLs per host : 1,000

URL delay : 10,000 ms

Crawler name : UbiCrawler/v0.4beta (http://gii.nagaokaut.ac.jp/~ubi/)

Contact e-mail : s077003@ics . n a g a o k a u t . a c . j p

For web-master. To stop LOP's crawling :

UbiCrawler supports the Robot Exclusion Standard. if you want to exlcude your site from being crawled by UbiCrawler see The Web Robots Pages.

Briefly, you can put this robots.txt file at the root of the web server you want to exclude from the crawling.

To monitor network traffic : http://gii2.nagaokaut.ac.jp/~ycchew/php/phpViewRrdGraph.php?rrdgraph=netTraffic&duration=day&btnSubmit=Submit

General status of Asia-NoCJK crawling can be view at here.

11:45:37 - ycchew - mySQL error with query SELECT COUNT(*) FROM nucleus_comment as c WHERE c.citem=1076: Table './nucleus/nucleus_comment' is marked as crashed and last (automatic?) repair failed

No comments