chrometweaks.org

Can anyone recommend a good, UK based and reasonably priced iPage web host?

Click Here To View All Answers...


First off, Can anyone recommend a good, UK based and reasonably priced iPage web host? Thanks for any response. Another question I got... Folks - can anyone tell me what the current state of "SID killing" to.

Prevent search engine spiders like googlebot from getting trapped.

In your site?.

We have an osCommerce iPage site implemented about a year ago, and.

To this day, over 90% of the iPage site traffic is spiders. We applied a.

Contributed fix late last year, but it didn't solve the problem..

Thanks in advance.....

Comments (128)

Yup, but... you might wanna make sure and wait for another commenter to confirm this as I am unsure of myself. Better yet, why don't you e-mail the iPage guys because they can answer you better...

Comment #1

Chris - thanks for chiming in..

I was told Ian Wilson's fix was applied, but I just searched through the site.

Code for "kill_sid" and there are no references..

So... I'm thinking now no patches were applied..

That said, is Ian's fix generally thought to work? I saw another post somewhere.

About a "SID killer - not Ian's" but it looked like that fix required a huge list.

Of search engine domains, which I'm not thrilled about..

And yes, we are getting SIDs attached to the URLs, which I assume is what.

Causes the spiders to keep believing they hit a new URL and get "trapped"...

Comment #2

No, Ian was never able to get his SID killer to work properly. The generally accpeted method is to maintain a list of spider 'user agents', and not assign the SID if a match is found. In fact, this method was actually adopted into the core code sometime in the last year..

Yes, your assumptions are correct about how they are getting trapped. This may also lead to invalid search engine listings, becuase if people faollow them, and the session is expired, they might only see an error on your website...

Comment #3

Thanks Chris..

Easiest way to get and incorporate the relevant code?.

Is it the array and loop in /includes/functions/html_output.php,.

Referenced by you in a post Nov 2003?.

If so, I'm guessing I should download the current codebase,.

And use html_output.php from that, to insure I have an.

Up-to-date list of user agents..

But let me know what you suggest...

Comment #4

Not sure what thread you mean, but yeah, I've posted about it at least a dozen times when we first discovered the problem..

I wouldn't worry about the up to date list. There's really only about 5 or 6 bots that have the problem. Googlebot, inktomi, fastbot and msn and a couple others. The code that I offered mathes partial strings too, so you should be able to get away with little maintentience..

Not sure where they are storing the user agents in the new codebase. It was in an array, then a seperate file, and there was talk of putting them in a table. Dunno what became of that...

Comment #5

Chris - thanks once again!.

Good info on the fact that it's only a few bots that cause problems, and.

Your list matches the source of our traffic..

Just one more question, if you don't mind... (though it's rather longwinded).

In your first reply to this thread, you asked.

"When your see spiders crawling your site, is an SID attached to the URL?".

What were you getting at here? The reason I ask is I have since discovered.

That some of the big offenders DO NOT result in SIDs attached to the URLs.

(which I assumed would always be the case)..

Example: Googlebot/2.1 accounts for over 75% of our traffic (the.

IPs beginning with 64.68 below):.

Reqs: %reqs: pages: %pages: Gbytes: %bytes: host.

: : : : : :.

92260: 14.70%: 92203: 16.17%: 2.163: 15.30%: 66.163.170.172.

87735: 13.98%: 87400: 15.33%: 2.118: 14.99%: 64.68.87.43.

78570: 12.52%: 78272: 13.72%: 1.897: 13.42%: 64.68.86.9.

73467: 11.70%: 73153: 12.83%: 1.774: 12.55%: 64.68.86.59.

72782: 11.59%: 72484: 12.71%: 1.757: 12.43%: 64.68.86.79.

69019: 10.99%: 68732: 12.05%: 1.666: 11.79%: 64.68.87.63.

66653: 10.62%: 66363: 11.64%: 1.608: 11.38%: 64.68.86.54.

I started looking into this once I ran the "Search Engine Simulator" you.

Mentioned in an old thread and found the output looking fairly normal,.

With no SIDs in the URLs..

Other bots *do* result in attached SIDs (like msn), and fit the problem.

As I originally understood it. Here's an example from my access log:.

65.54.188.82 - - [29/Aug/2004:00:29:26 -0500] "GET /catalog/product_info.php?products_id=945&action=notify&osCsi.

D=622f7cd701394041840ec954d7436551 HTTP/1.0" 302 0 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)".

65.54.188.82 - - [29/Aug/2004:00:30:26 -0500] "GET /catalog/product_info.php?products_id=646&action=notify&osCsi.

D=046c57d2d10b46ed780dfe570291514d HTTP/1.0" 302 0 "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)".

But none of these cause any significant traffic (the other big offender in.

The list above is YahooSeeker/1.2)..

So, now it appears over 90% of our traffic is spiders, and the request.

URLs do not have SIDs attached..

Can you explain this?..

Comment #6

Yeah, that Yahoo spider is a pain..

But, if the request URL does not have a SID attched, that that means that you must have a sid killer in place, or that you force cookie use. Either way, they aren't causing you any more trouble than they usually do..

Just mark the bandwidth thye use up to 'cost of having a website'. If you want the traffic they bring, they you have to allow them to suck up some of your bandwidth crawling your site..

Sorry I don't have better news...

Comment #7

To work around this problem, I have a 404.php that returns a search page - - it's a contrib somewhere around here, if I remember right..

-jared.

This post has been edited by.

Jcall.

: 08 September 2004, 21:02..

Comment #8

Guys - I'm willing to accept Chris's explanation that a certain amount of traffic from.

Search engines is to be expected. Still, 14GB per month, over 90% of the total.

Traffic, seems excessive for a iPage site with a modest number of products, customer.

Reviewing disabled, etc..

Is there any way to get stats on expected or normal traffic from search engines.

For a "typical" osCommerce site?..

Comment #9

No, you're right 14GB is WAY too much bandwidth to be consumed by bots. In total, you should see less than 1GB consumed by bots. Here's an excerpt form my awstats, showing bots, and how much bandwidth it took..

Inktomi Slurp .....................................33.26 MB.

Googlebot (Google).................................97.91 MB  .

Unknown robot (identified by 'crawl')...............9.34 MB.

Alexa (IA Archiver)................................82.85 MB.

Unknown robot (identified by hit on 'robots.txt').116.88 KB.

Jeeves .............................................4.06 MB.

Unknown robot (identified by 'spider')..............1.62 MB.

WISENutbot (Looksmart)..............................3.52 MB.

Unknown robot (identified by 'robot').............294.89 KB.

LinkWalker..........................................1.13 MB.

Scooter (AltaVista)................................89.95 KB.

Voila .............................................17.07 KB.

Walhello appie ....................................45.31 KB.

The important question here would be has your iPage domain ever has an OSC store on it that did not have the SID killer in place?..

Comment #10

Hi, I am a total newbie and was reading the interesting post..

What do you use to monitor traffic on the network to identify these spider bots?..

Comment #11

As I mentioned in my previous post, I used 'awstats' to capture those statistics. It came with my cpanel for my account on our server...

Comment #12

Folks - can anyone tell me what the current state of "SID killing" to.

Prevent search engine spiders like googlebot from getting trapped.

In your site?.

We have an osCommerce iPage site implemented about a year ago, and.

To this day, over 90% of the iPage site traffic is spiders. We applied a.

Contributed fix late last year, but it didn't solve the problem..

Thanks in advance.....

Comment #13

What contribution did you apply? When your see spiders crawling your site, is an SID attached to the URL?.

This post has been edited by.

Wizardsandwars.

: 01 September 2004, 15:59..

Comment #14


This question was taken from a support group/message board and re-posted here so others can learn from it.