
网页 DOC PDF PPT XLS
- Crawling and Web Indexes
Only crawl pages you're allowed to; Respect robots.txt (more on this shortly) ... For a URL, create a file URL/robots.txt; This file specifies access restrictions. 13 ...
www.cs.wright.edu - 网页快照
- netlab.szpku.edu.cn
何为robots.txt. 一个robot访问某个web站点时,它首先检查该站点上的robots.txt文件 。如果找到robots.txt,则分析其内容,看该站点是否允许它获取该站点上的文档。
netlab.szpku.edu.cn - 网页快照
- [www.w3.org].ppt
Achievements of /robots.txt ... Simple = good; Add PleaseVisit to /robots.txt; Bind policy with resource; Locate more meta data with resource; Fix the Web ...
www.w3.org - 网页快照
- Crawling
Robots.txt example. No robot should visit ... Respect politeness/robots.txt; Avoid getting stuck in traps; Detect/minimize spam; Get the “best” pages. What's best?
medialab.di.unipi.it - 网页快照
- Web Crawling
A server can specify which parts of its document tree any crawler is or is not allowed to crawl by a file named 'robots.txt' placed in the HTTP root directory, e.g. ...
www.cs.kent.edu - 网页快照
- Information Retrieval
Caches the contents of robots.txt for various web sites as it crawls them. Designed to be extensible to other protocols; Had to write own HTTP requestor in Java ...
www.seas.upenn.edu - 网页快照
4566文档搜索©2010 www.4566.info