Geeks With Blogs
Path Notes of a Kodefu Master blog

At ConvergeSC 2009, an attendee asked me to describe how to prevent crawlers from trolling your ADO.NET Data Service. I explained as best as I could, but I felt like a blog post might make it more clear.

To a bot, your data service looks like any other site on the web. Sure, it’s reading either Atom, POX, JSON, or some other bizarre format you’ve concocted, but it’s still data coming down through http and discoverable via links. There are ways to prevent a bot from crawling, but information available on the web that doesn’t require authentication can be crawled.

The first way to prevent your service from being crawled by a legitimate bot is to put a Robots.txt in the root of the site. Inside the file, put the following lines:

User-agent: *
Disallow: /

This locks down the entire site from being crawled by the bot. If your service coexists with a site you want to be crawled, you can change the Disallow option to /MyService.svc/. be sure to include the closing slash so other pages aren’t accidentally matched.

The conference attendee seemed to be concerned specifically about anchor tags and AJAX. If you’re using the OnClick event of the anchor tag, most spiders will not follow it. However, if the uri is in an href, a crawler will pick it up. Bing and Google will honor a rel attribute with the “nofollow” value to prevent indexing the page. However, Yahoo and Ask will still follow and index a link with that attribute.

If your service is publicly available, using the robots.txt is the way to go. If it’s not publicly available, the service should already be locked down through authentication techniques.

Note: Cross posted from KodefuGuru.
Posted on Monday, June 29, 2009 4:20 PM | Back to top

Comments on this post: Prevent Crawlers On Your Data Service

No comments posted yet.
Your comment:
 (will show your gravatar)

Copyright © Chris Eargle | Powered by: