chrometweaks.org

Can't find my domain on iPage.com?

Click Here To View All Answers...


Quick question: Can't find my domain on iPage.com? Thanks for any comment. 2nd question I got is.. I've had one crawling my iPage site for the last 12 hours. I was finally indexed earlier this week (after a two month wait), however my page rank sux (3/10)..

I noticed with the IP Tracking contrib that the majority of urls that Google's bot is crawling have the osc session id in them. I was wondering if this was a bad thing. Anyone know?..

Comments (76)

I would like to know the answer too. Anyone here know what is the answer to that question. I'll do some Googling and get back to you if I discover an good answer. You should email the people at iPage as they probably could give you an answer..

Comment #1

Google is STILL all over my site. I've got something like 6000 hits from them in a 24 hour period..

I had to wait about 5 weeks before they crawled my iPage site the first time. Then it took about 3 or 4 more weeks to get me indexed. Then a week after I'm indexed, it seems that they cant leave me alone..

That bot must've found something it liked, however it seems to be in a loop. It's hit the same pages over and over and over again with different session ids. I wonder what that means....?..

Comment #2

Not sure how you have your meta tags setup ... But, I just had the googlebot crawl my iPage site this weekend as well. Didn't look to me like it looped itself (not using 'friendly urls' through osc). But, did see about 400+ hits from their bots to my site. Which is about right with the number of products/etc that I've listed. So, as far as their bot crawling, seems to work ok for me..

I know a 'gigabot' also hit me a few days ago .. and I know for a fact that it was looping a lot until it finally staved off before the weekend..

As far as OSC goes, using a CVS from 091702 .....

Comment #3

Same here.... mine is getting hit right now..... had 40,000+ when I left for work...looked at it just a second ago and it was showing 364 customers online ( I wish) and Im up to 65,302 hits as of now........

And Im looking for the answer right now about the session id's..

Comment #4

No wonder your iPage site gets hits. look at all of the women in skimpy stuff. I dont' want to get into details. lovely site...

Comment #5

Can anyone point me to the thread about the spider trapping php scripts and how to use it?.

Thanks..

Comment #6

75159 and they're still on it...... not in as many numbers as around 2am this morning but man..... something is messed up, they should have left by now...

Hopefully they are just being picky and gtathering every aspect of my iPage site for that Number 1 ranking.. lol...I'll keep dreaming..

Comment #7

I just got hit by Google as well..

Its actually indexed a couple of the OSC pages this time. The main page was indexed fine with no problems. The other page, I think it was one from the information box, shows in Google about 500 times - with different session id's. Strangely, no other pages show..

I've got the allprods mod, I thought I had updated it so Google wouldn't get the session id's....but it still is. Back to the drawing board..

One thing I have done that has helped is to add a robots.txt file to stop it touching the files / directories that it doesn't need to. This has cut down on the amount of hits it generates..

Jon...

Comment #8

Would you happen to be using that 'meta tag' contribution that's in the.

Downloads section?.

Perhaps, the guys who are getting the bots looping in and out are all.

Using that ... (just wondering if it's coincidence)...

Comment #9

Hi..

My iPage site is rated 4/10..

My iPage site has been listed in Google for 4 months, DMOZ for 5 months. Every 3 weeks the bot was crawling my index page. And that only. It had this behaviour for the 4 months..

Some days ago I've edited the html_output.php (as stated by someone in a post) to remove the session id whenever the user agent is a bot, like Googlebot. From almost that day until today Googlebot has been crawling.

All.

My pages (first the "static" ones, then the ones with the '?' character in it) in a particular cycle (+/- 2/3 pages by hour by IP). For the first days, the results in Google weren't reflecting the bot logs. 3/4 days past the bot visit every search in Google reflected the pages that the bot crawled (does this verb exists? :?: ). From some days now it's crawling every product_info.php..

I hope to get them listed tomorrow or later. The bot doesn't get any session ID in the URL, I don't use friendly search-engine URLs or the META contribution..

Just FYI..

Wishes of good sales. :wink:..

Comment #10

I am also been hit all day by google at least I hope so the ip is 216.239.46..

From what I xcan find out here it appears to be their ip range. The links have a ocScid and a long number after them. Is this session id? If it is how do I stop that as it appears by the comments it is not a good thing..

I am using allprods in the footer and have not used the meta tag add on yet..

I am also experience problems with customers signing up for an account. They say after they fill in the account information and click continue the page just reappears and they cannot continue. I have created 10 accounts with no problems at all and can not figure out why customers are having this problem. Anyone else having the same problem?..

Comment #11

I am using both the Meta-tags contib and the search engine friendly urls, adn I still get the osc session id in the urls that google is crawling..

And I also was experiencing some performance issues creating an account while google was spidering my site..

Google has finally left my site, but they left me with 10,000 more hits than I had before. WOW. I had no idea the spider would be so intense. [/quote]..

Comment #12

I am a newbie here, why is it bad to session id's in the url they are scanning and how do I stop it..

I have got friendly url's turned on and I am using allprods ver 1.5.

Glad you also ( as in not just me) had problems with creating a new account I could not reproduce this all weekend ans was able to create as many accounts as I liked. I am in Ireland and my server is in canada and the customers with the problems were in Canada and USA..

Strange!!! but life is always strange...:-)..

Comment #13

To tell you the truth, I'm not sure why it's not good to have the session ID, other than the fact that google doesn't like them..

The problems I was having while google was spidering me was that it took a long time to create an account. I attibute this to the fact that Google was throttling either my server or my database...

Comment #14

I was hacked by googlebot (Google) in last few days. that drived my shop to high load. I have about 900 products. I figure out that googlebot try to crawl every link and every page and create about 400 visits at a moment (by who's online in the admin panel). considering the session id problem, I am forced to remove most session ids (from the tep_href_link function in html_output.php) and also define my robot.txt file to tell no index and no follow for most of the pages. I even defined some robot meta tags to do this..

Btw, google seems not like php files (because I don't see my php file listed). I am going to change the php file extension to other names, e.g. htm or asp, etc....

Regard..

Comment #15

Can someone post the code changes to html_output.php that are needed to stop the session id being added when it is a robot crawling the site. Or if possible, email the whole file to me at themanager@groovecity.co.uk. I tried making the changes before but they didn't work, I'm guessing I got something wrong - hence why the whole file would be useful..

Its in there at the moment - for some reason it seems to hit the iPage site for about 3 days and then goes away for a month..

Thanks,.

Jon...

Comment #16

Here is the post for changing the html_output.php file to use no sid's.

Http://forums.oscommerce.com/viewtopic.php...p?p=68158#68158..

Comment #17

By the way if the spiders are on your iPage site currently and you add the changes to your html_output and upload it, it may take awhile for the bots to refresh the sid's it is using and reflect the core URL's of your site..

Slowly but surely you can watch from your whos_online and you see them start to lose the session id's..

They are finally getting what they came for and getting out of the loop on my site...

Comment #18

Great - thanks. But it would be good to get a view from the core team on this "amazingly simple" 8) mod - the problem of SIDs has been on the 'to-do' list for some time now.... If this works, then it should be part of CVS...

Comment #19

I have to say that I am getting the same. Either we are missing something or the pages that do get indexed do not hang around on who's online..

I suspect that we are missing something and that the session id is added somewhere else instead..

I also do not see any indexing of my allprods page...

Comment #20

Yeah, I'm disappointed that the damn bots are still all over my site, and have the session ids in the majority of links being indexed as well..

Your're right, there must be another place that is adding the session id. It is indexing my allprods page though, and I have 3 or 4 products that it is indexing now with no session ids, but no progress in 8 or 10 hours..

We may have to looks at temporarily changing our robots.txt file just to get rid of the thing for now...

Comment #21

Could the other place that they are being set be classes/sessions.php, I think it might but it is way too complex for me to start hacking..

Comment #22

I've made the changes to html_output.php and I'm still getting hit by the Google Bot using session ids. However, I just noticed something interesting..

I used to have search engine safe url's turned on, I turned it off yesterday. Whats interesting is that all the Google url's are search engine safe, every single one of them..

This suggests that it is just looking at urls, from a list that was built yesterday, or even earlier. If it had followed any links they would appear as normal..

So....if the same is happening to anyone else, I would give it a bit longer before working on the problem more. Its done about 60,000 hits in three days and I only have 40 products in the catalogue. For some reason it hasn't hit my other shop, which has 2,500 products (we'd probably be talking millions if it had)..

Jon...

Comment #23

For what it is worth, this hack creates exactly the same level of success as the original hack to replace sid with NULL. This means that the hack is correct, but just in the wrong place...

Comment #24

I just checked and the only other reference is in a couple of sessions.php files, but these are purely there to set the session id - they don't put it in the url. The only place I can see that happening is in html_output.php..

I've tried fixing it before, but backed the changes out as it didn't seem to do anything. I think it really is just a case of giving it a bit of time to sort itself out..

Jon...

Comment #25

I would guess that the hack needs to wrap-around:.

Function tep_session_start() {.

Return session_start();.

}.

In classes/sessions.php.

As session_start() is a php function - not OS. ie. Dont start a session if it is a bot..

But I really do not know enough about this?..

Comment #26

Forget the above, I just commented out the php function and loaded it to a test store and still it creates sessions!..

Comment #27

Aggggghhh! Stupid B**** - session_start() is in functions/sessions.php - I uploaded the wrong file :oops:.

Anyway, removing session_start() certainly stop the sessions (and any chance for a customer to purchase) so, just need to wrap the hack around it somehow??..

Comment #28

Still the SIDs dont go - so another thought or two:.

1) I dont see SIDs for normal customers so why google? Answer could be that google is sending the SIDs itself - maybe from an earlier session, who knows..

2) It is reported in these forums that Google gets trapped in the iPage site - regardless of the SIDs, this seems to be a key problem..

So maybe we cant actually solve this issue?..

Comment #29

Session IDs are only added to the URLs when a customer does not have cookies enabled. Obviously, the google spider mimics a customer that does not have cookies enabled..

I'm thinkin that you are correct in that the spider is working off of a previously programatically generated list. I've noticed that the percentage of urls google is spidering that do not have session ids is growing, however slowly..

If I don't notice a big difference by late this afternoon, I'm going to telnet into my server and manually kill all of googles processes...

Comment #30

I with you on this - I just turned off Safe URLs and still the list grows - it's like an Alien attack!.

So it must be that Google are replaying last months 30,000 hits. It is logical when you think about it. So, maybe your hack works after all?.

But it will take a couple of months to find out!.

On a brighter note, it means that it will be a long time before Google runs out of links to my iPage site to test .....

Comment #31

I didn't have this many hits from last month. I only had a few. I think that the spider is cyclying through a list of urls that it made earlier this week...

Comment #32

The other day when I checked, Google had the main page indexed correctly, and 1 other page indexed many 100's of times, each with different url's. If it was using these as a base, it would explain all the hits I am currently getting. The duplicated url has now gone from the Google index, due I imagine, to it currently re-indexing the site..

This could mean that Google needs to index the iPage site twice, before it clears up the session id mess. I'm still not seeing any clean url's from Google though..

I tested the changes before and they seemed to work ok for me.....

- Turn cookies off, go to your site, say no to cookies, answer no and you should see the session id in the url..

Then.

- Update the changed code so it will pick up on your browser as a spider (I think you set the agent to MSIE or something similar - you can see what you need from your iPage website logs), turn cookies off, go to your site, say no to cookies and you shouldn't see the session id in the url..

This proved the changes do what they are supposed to. Though you do then need to ensure that all the spiders are picked up by the code. I'm going to go through our web logs to see if any are missing when I get a chance..

Jon...

Comment #33

Jon, That confirms what I had already tested, so I guess the hack will work, if you have the IP address, or name of the robot in the code. I think we have most of them covered..

Problem is, I can't git rid of Google NOW. Apparently, it's working off a list it built a few days ago. I telnet'ed into my webhost server, and google's process does not show up with a "ps -ef", even when logged in as root. So we can forget trying to kill the process..

Alternative I ca nthink of now to get rid of google, are the robots.txt file and the htaccess file. I'd hate to resort to the htaccess file, but I may have too, I'm losing business due to poor performance...

Comment #34

I am wondering if it is possible to use wildcards in the ip adresses....I noticed that googlebots ip's were all starting with 216..

It looks like they finally got enough of the url's to index and have left my iPage site for more than a few hours now......Ihave my fingers crossed for a good index and that they dont return for a few more weeks..

Im going to try the wilcards in the code, tonight or tomorrow afternoon and see if it works. If it does I'll let you guys know..

Comment #35

I'm under the assumption that you dont need a wildcard. If you take a close look at the ip addresses in the list, many of them only have 3 numbers with a dot (.) at the end. I'm sure that you could do the same thing with 2 numbers with a dot at the end..

Also, this code looks at the name of the bot as well, so anything named 'oogle' will be caught as well...

Comment #36

You CAN use wildcards in ip addresses but it is.

Not advisable.

... You'll be blocking alot more sites than just Google!.

Especially if you just go for the 216.* approach... DON'T DO THAT!.

BTW: in another post it was mentioned the core Dev team is looking in to this issue.. Jan advised to use the robots.txt for now.....

Comment #37

This code below has successfully worked to remove the session ID on this last Googlebot crawl. it does not rely on having the correct IPs for the Googlebot. there are just too many to keep track off. the other.

Code.

With all the IPs was originally written to redirect the bots to allprods.php..

The code below is from this.

Old post.

, I did not write this. the session removal code goes after the search engine frienly url code and before the code that appends the session to the link. i've included the code as a reference..

I also have a pretty substantial robots.txt file:.

Ryan..

Comment #38

A little cleaning-up of this code results in:.

// start session ID removal.

       if (ereg("ooglebot",getenv("HTTP_USER_AGENT")) {.

         $sess = NULL;.

       }.

       if (eregi("webcrawler",getenv("HTTP_USER_AGENT")) || eregi("internetseer",getenv("HTTP_USER_AGENT"))) {.

         $sess = NULL;.

       }.

// end session ID removal.

Could this be tested? If it works it could be a base for a solution. But we need more acknowledgements that it actually works...

Comment #39

As it is, it crashes my installation - I guess this was written for an earlier version of OS or I am putting it in the wrong place?..

Comment #40

What version of html_output.php are you using? You will find the version number at the top of the file...

Comment #41

Ok open up the file and paste the following code.

   if (eregi("googlebot",getenv("HTTP_USER_AGENT")) || eregi("internetseer",getenv("HTTP_USER_AGENT")) || eregi("WebCrawler",getenv("HTTP_USER_AGENT"))) {.

     $sid = NULL;.

   }.

After line 47 but before.

   if (isset($sid)) {.

     $link .= $separator$sid;.

   }.

Let me know if this works..

Comment #42

With a normal browser session:.

Sid is assigned.

Sid is added to the URL strings.

With a 'special'.

Browser session set up to mimic Googlebot:.

Sid is assigned.

Sid is.

Not.

Added to the URL strings.

HTH..

Comment #43

Mark.

No crash - so it looks promising!.

Google is all over my iPage site now - but as we have reported before, I think that it will take a while before we can see the impact as it is probably re-playing urls including the sids that it picked up before...

Comment #44

Well my iPage site has now gone down!! I have tried all the codes upto yesterday here. I have turned off friendly urls as someone else said it had made a difference. Now I can not see my iPage site in a browser..

Can you post the code for output html completely with this new vervion without the listed ip's for php gobshits like me? :cry:.

My iPage site was running lastnight so I would like to upload this new code as my admin panel shows that the bots are still clocking up hits. I guess there is just too much activity to allow me to visit my own site..

Thanks guys. The only thing that stops people like me throwing in the towel is knowing I am not the only poor sod suffering this way..

If there was a knighthood for free help you guys would clean up at the awards...

Comment #45

Ryan,.

It looks to me like this new code that you have posted does exactly the same thing as what I posted originally. Actually, I would need to add "ooglebot" to the $spider_footprint , but otherwise it looks like it has the same effect..

The IP is an additional test, to make sure we trap the bastard. Of course, with the masking in there, it is possible that it accidentally gives us a false positive...

Comment #46

I have the same redirect spider code implemented to redirect Google to my allprods.php file, but not for removing the session..

For some reason the redirect spider code.

Did not.

Redirect Google to allprods.php..

I know that the code that I posted works for removing the session for my site, but I would like to see if anyone else has success with it..

I'm sure there is some middle ground we can find between the two scripts to reliably catch Google for everyone..

I haven't seen the huge number of hits that you guys are seeing though. From 10/3 - 10/5 Google only visited me 48 times. Granted we are a new site..

Ryan..

Comment #47

Ryan.

Google has been going at my iPage site for 3 days now with up to 300 links at a time, day and night. And no sign of a let up. :evil:.

At least it has proved that OS can EASILY handle 3 or 400 simultaneous users without any problem. So thank you Google for doing my soak test for me.

I am certain that what it is doing is replaying the links, including SIDs, that it found last month - 30,000 hits in total - and so the redirect code / spider detect etc will not show any real result until at least next month (if not later as I only made the changes yesterday). Hopefully, in time, the URLs with SIDs will drop off the Google iPage site and be replaced with nice fresh, clean, ones...

Comment #48

Ryan,.

I have no doubt that what you have in place to remove session IDs works perfectly. I'm just saying that the two are very similar, except that the script designed to read the allprods file ALSO looks for IP addresses in addition to the name..

One reason that the allrpods "redirect" might not be working is because it is ot "redirecting". It is actually using a "read" command which unless it is a process running on YOUR server, is not going to work. If you change the "read" to a "redirect", it might work, but then Google says that they dont like redirects. I think the better solutino is to move the catalog up a level and just put a link to allprods in your catagory infobox. As witnesses by Googles bombardment this week, they will try to index each and every link on your page. If they find one with lot of good content, they will parse it multiple times..

Ian-san,.

I have seen some let up in the past 24 hours from google. It appears that the script we implemented is working, it just has to cycle through all of the urls it listed previously, as you said. I expect it will still be another day or two before they are completly gone. Next month, though, should be a great improvement...

Comment #49

Chris.

Fortunately, I am not getting too much of a perfomance deterioration (due I think to the fact that Google goes to bed when my iPage site is live in Japan) but I really dont want this problem next month..

You comments have been invaluable, thanks...

Comment #50

Well my iPage site came back online all by itself so I can only guess that bots were just to much. This morning I had 800 online!!!.

Since this bot thing started I have had a strange side effect and I am wonder if you guys feel it is a side effect and not a problem.(before I start to change things) The server is in Canada, I am in Ireland and I can created as many accounts as I like and always they appear in the database and I recieve my welcome mail. This also applies to others I have asked to test for me. Now people in usa and canda first said that when they created account the shop would just keep reloading the create account page. then they said this week they were getting as far as the congratulations page but they did not get the email and they do not appear in my database..

Do you feel this is related to the server load on the iPage site from the bots or do any of you reconize this problem and if so how do I fix it..

The bots are getting less now and when there gone I will ask testers to try again and I hope it was the bots. If it was it is a problem we can not afford to have every month when the bots are about..

I find it somewhat amazing that I have spent years trying to get search engines to list my sites and now over night they have become the curse of my life...:-).

Again if anyone can post the new version of code for output html I would really really be grateful so I can change the version I have. I have the version already posted above with the ips listed...

Comment #51

Terry.

In html_output.php. This is a merge of both solutions (I am cautious sort of person ..) so it includes some duplication...

// Add the session ID when moving from HTTP and HTTPS servers or when SID is defined.

If ( (ENABLE_SSL == true ) && ($connection == 'SSL') && ($add_session_id == true) ) {.

$sid = tep_session_name()'='tep_session_id();.

} elseif ( ($add_session_id == true) && (tep_not_null(SID)) ) {.

$spider_footprint = array( "ooglebot", "rawler", "pider", "obot", "eek", "canner", "lurp", "cooter", "rachnoidea", "KIT", "ulliver", "arvest");.

$spider_ip = array("64.209.181.53", "64.208.33.33", "64.209.181.52", "209.185.108.", "209.185.253", "216.239.49.", "216.239.46.", "204.123.", "204.74.103.", "203.108.10.", "195.4.183.", "195.242.46.", "198.3.97.", "204.62.245.", "193.189.227.", "209.1.12.", "204.162.96.", "204.162.98.", "194.121.108.", "128.182.72.", "207.77.91.", "206.79.171.", "207.77.90.", "208.213.76.", "194.124.202.", "193.114.89.", "193.131.74.", "131.84.1.", "208.219.77.", "206.64.113.", "195.186.1.", "195.3.97.", "194.191.121.", "139.175.250.", "209.73.233.", "194.191.121.", "198.49.220.", "204.62.245.", "198.3.99.", "198.2.101.", "204.192.112.", "206.181.238", "208.215.47.", "171.64.75.", "204.162.98.", "204.162.96.", "204.123.9.52", "204.123.2.44", "204.74.103.39", "204.123.9.53", "204.62.245.", "206.64.113.", "204.138.115.", "94.22.130.", "164.195.64.1", "205.181.75.169", "129.170.24.57", "204.162.96.", "204.162.96.", "204.162.98.", "204.162.96.", "207.77.90.", "207.77.91.", "208.200.146.", "204.123.9.20", "204.138.115.", "209.1.32.", "209.1.12.", "192.216.46.49", "192.216.46.31", "192.216.46.30", "203.9.252.2");.

$agent = getenv('HTTP_USER_AGENT');.

$host_ip = getenv('REMOTE_ADDR');.

$is_spider = 0;.

// Is it a spider?.

$i = 0;.

While ($i < (count($spider_footprint))).

{.

If (strstr($agent, $spider_footprint[$i])).

{.

$is_spider = 1;.

Break;.

}.

$i++;.

}.

If (! $is_spider).

{.

$i = 0;.

While ($i < (count($spider_ip))).

{.

If (strstr($host_ip, $spider_ip[$i])).

{.

$is_spider = 1;.

Break;.

}.

$i++;.

}.

}.

// If it's a bot, don't add the session id.

If ($is_spider).

{.

$sid = NULL;.

}.

Else.

{.

$sid = SID;.

}.

}.

// Belt and Braces - another way to trap the spider...

If (eregi("googlebot",getenv("HTTP_USER_AGENT")) || eregi("internetseer",getenv("HTTP_USER_AGENT")) || eregi("WebCrawler",getenv("HTTP_USER_AGENT"))) {.

$sid = NULL;.

}.

If (isset($sid)) {.

$link .= $separator$sid;.

}..

Comment #52

Ian-San -.

There is one thing I don't like about this solution - it costs. Lots of eregis and stuff executing when they are mostly (95% I guess) not needed..

There must be a better solution. Using your solution on some of those cheapy-cheapy ISP that some of our community members seem to use will get them into serious trouble..

Nevertheless - the input from this thread is very productive..

Combining our knowledge is one of the core assets of OpenSourec development. All of you that added to this thread deserve respect,kudos etc..

Thanks!..

Comment #53

Ian,.

Thankyou very much, I have updated my code now and hope it will help next time around..

The bots have left in the last 1/2hr and everything seems quite thankgod. Either that or they were bots on drink and have now passed out!!!.

Thankyou all for the help and I look forward to the day that I can return this help. After this I am going to try and get above a php gobshit and learn a little so I can be a not so php gobshit..

Thanks again.....

Comment #54

I was also concerned about the overhead but the problem Google gave us during the past few days, put this issue second..

In all honesty, I have not noticed any slow-down of my iPage site since adding this code (maybe it was slow already!) but it is hard to tell what a 28kbs dial-up would experience..

Anyway, the cut-down solution mentioned earlier in these posts will probably be okay for those worried about overhead. Hopefully, we can get some feedback on both solutions..

// Add the session ID when moving from HTTP and HTTPS servers or when SID is defined.

If ( (ENABLE_SSL == true ) && ($connection == 'SSL') && ($add_session_id == true) ) {.

$sid = tep_session_name()'='tep_session_id();.

} elseif ( ($add_session_id == true) && (tep_not_null(SID)) ) {.

$sid = SID;.

}.

// New Code to trap spiders.

If (eregi("googlebot",getenv("HTTP_USER_AGENT")) || eregi("internetseer",getenv("HTTP_USER_AGENT")) || eregi("WebCrawler",getenv("HTTP_USER_AGENT"))) {.

$sid = NULL;.

}.

If (isset($sid)) {.

$link .= $separator$sid;.

}..

Comment #55

[quote="Jan0815"]There is one thing I don't like about this solution - it costs. Lots of eregis and stuff executing when they are mostly (95% I guess) not needed..

[/quota].

Here's my modified solution, it has run for more than one month, now all googlebot visits don't have any SID anymore. Actually I think this solution can catch most of the robots, without to much overhead..

[quota].

// search engines don't want session id.

$spider_footprint = array( "bot", "rawler", "pider", "ppie", "rchitext", "aaland", "igout4u", "cho", "ferret", "ulliver.

$spider_ip = array( "216.239.46.", "213.10.10.116", "213.10.10.117", "213.10.10.118", "64.41.153.100", "192.134.99.192",.

$agent = getenv('HTTP_USER_AGENT');.

$host_ip = getenv('REMOTE_ADDR');.

$is_spider = 0;.

// Is it a spider?.

$i = 0;.

While ($i < (count($spider_footprint))) {.

If (stristr($agent, $spider_footprint[$i])) {.

$is_spider = 1;.

Break;.

}.

$i++;.

}.

If (! $is_spider) {.

$i = 0;.

While ($i < (count($spider_ip))) {.

If (strstr($host_ip, $spider_ip[$i])) {.

$is_spider = 1;.

Break;.

}.

$i++;.

}.

}.

// if visitor is a spider, then don't attach session id.

If ($is_spider) {.

$sid = NULL;.

}.

[/quota].

Look into the code, you can see that for googlebot, the first comparesion will get a match and thus stop all further comparing. So only those unpopular robots will cost some slightly more time..

By checking your webserver's log, you can adjust the order in $spider_footprint, to save some more time..

In my opinion, this hack doesn't add any obvious server load. Save your time to do something more interesting, such as install php-accelarator, which will almost cut your php scripts running time by half...

Comment #56

Sorry, some mistake when I copy/paste the code in my last post..

The spider definition should looks like this:..

Comment #57

There are many things one can do for iPage site optimization. The iPage site I did.

Zoomone.com.

(an adult dvd shop) was listed straight away..

How did I do it? simple - good links, lots of links from popular sites will get you into google. this is the only way. Meta tags are useless...

Comment #58

Ian, yes, I think that I am in agreement with you..

How can you ensure, though, that the robot goes to the allprods first, without using a redirect? If you use the bot detector in the index.php, and redirect it to all-prods, google supposedly does not like that. If you are using the "read" command, as I was, it doesn't work, as evidenced by this weekend. I now only have a simple link in the "catagories" box to the allprods page, but this doesn't guarantee that the robot will go there first..

The bot detector does work great for determining whether or not to add a SID to the URL, though, so this is worth keeping. We might even want to put this in a function in the functions directory, because I can see needing to use this in other places as well...

Comment #59

If the spider detection does make it in to the main code there are a few other changes that will probably be necessary..

The hit count code will need updating so that it does not increment for a spider hit. Plus the code that handles the products_viewed count will also need updating to ignore the spider hit..

Maybe the code needs to go in at the beginning of the page load (application_top ?), setting a flag to say that it is a spider. This flag can then be checked for by the rest of the code. This would also save on processing as the check would only be performed once for each page, rather than a number of times on each page..

Jon...

Comment #60

Oh Dear, I never meant to open a can of worms. Sorry if my idea has caused so much chat..

I see by the answers that there is indeed a lot more envolved than just placing the allprods.php into the root of the iPage site and blocking the cataolog folder from the bots..

What I did not know was that the bots have to be able to follow the links they find. I though they just recorded the links a went home and that the problem we were having was that for some reason they were looping around our shops unable to leave..

Anyway thanks for not sending the guys in white around to take me away today lol ..

Keep up the great work and I will watch this space quitely and stop distracting you all, from the breakthrough I know you are all very close too!!..

Comment #61

Terry.

Most of us only made the spider mods in the last few days so it is probably too early to make more big changes. I am confident that the script we have used will stop the mega-multiple listings..

As I see it, Google hits the iPage site 200 - 300 times, each time it hit is was getting a new SID so instead of finding say 100 products, it thought it had 100 x 300 products = 30,000. But as all listings have SIDs, they are all pointless..

Now, with the mod, Google should see each of it's 200 -300 hits as being identical and so cut the duplication..

I was diapointed to see that my listings on Google have gone up to more than 30,000 again (it went down to 7,500 earlier this month) but I am sure it is just a question of time..

Making more changes to allprods is fine, but we need to be careful about how we do it. Softly softly catchy monkey as they say...

So, for me, nothing more to do on this for now, so it's back to my pint of the black stuff .....

Comment #62

I think a big lesson learned here is to have a few key items implemented in your iPage site before you go live with it..

1. remove SID for bots.

2. implement robots.txt (i posted a copy of mine in this thread).

3. Search engine Safe URLs (not as important as it used to be, but still worth it).

4. ?? anything else ??.

If these are implemented correctly, Google should not go crazy on your site. It will only have to index your iPage site a small number of times instead of getting stuck in a SID loop. Also, the next time Google comes back it will have never cached your SID URLs, thus it will never try to hit those URLs..

Block bots from going into areas where it does not need to go by using robots.txt: log in, shopping cart, checkout process, etc..

This has been a very imformative thread and I hope that some of the key items can be included in the CVS tree for everyone to use..

Ryan..

Comment #63

Could someone who followed closesly all 11 pages of this topic so far please post a synopsis of it's conclusion ? .

I've tried following from the start but opinions keep changing, and I'm getting confused.

Many thanks..

Comment #64

Daniel.

Take you choice of one of the 3 versions of the spider catching mod to html_output shown on page 8 (big, small or medium) plus add a robots.txt file to your url root directory - see the example on page 9 (although you may want to leave out the disallow to the files that show products e.g products_info). Then add the allprods contribution to your site..

Thats about does it I think for now?..

Comment #65

Yes, this change disables the session ID, but a customer can not add the item into the basket without login..

So it has a iPage site effect that you don't want. :?.

Joseph..

Comment #66

I just had a bot on my iPage site it came it saw it went and did not even get a session id it went to my allprods page and followed all the links. I use a simple text link in the information box to send it to the allprods page.

I used everything that was mentioned in this thread and it seems to be working. I use the user tracking mod to see whats happening on my server when I called the stats it nearly killed my server but as I said in the user tracking there were no session ids to be seen..

I used the .htaccsess that was mentioned.

The kill session ids for the bot.

And a simple text link in the infobox to allprods..

Comment #67

HELP....

I've looked everywhere but cannot find the answer..

I've found the code to add to html_output on page 8 (the all inclusive one).

Where on earth do I paste this in my html_output.php file? and is there any code I must remove?.

Also is that all I must do apart from add a robots.txt?.

Thank you for your help...i'm not a php programmer at all !.

Best Regards.

James Anthony.

Www.gadgetworldonline.com..

Comment #68

James.

I posted the before and after to your mailbox so you can see how to do it...

Comment #69

I have set all the coding up to optimize my iPage site for search engines etc..

I have put in the new code into html_output.php, installed the Allprods.php addon, and linked to it on the first page, and have put in the robots.txt that was pasted earlier in this forum..

Could someone be so kind as to check out my page and let me know if I have done everything ok so that the search engines will start ranking me?.

My iPage site is www.gadgetworldonline.com.

Thank you very much for any help..

Also how can I tell myself if a google bot is examining the pages etc, or whether they are real hits or not?.

Cheers.

Best Regards.

James Anthony.

Www.gadgetworldonline.com..

Comment #70

I propose that this POST should be moved to the "Development Room".

Could someone add a generic "robots.txt" to the Contributions Sections like it was done with a generic ".htaccess"..

And the code of "html_output" could also be contributed the same way.

We all appreciate since it will be there forever and this Topic will be vanished from sight only a few days after people stop posting.

Best and thanks to you all,.

Lopo..

Comment #71

There was no check for inktomi in the code. I would put this spider second to google:.

It just hit my site:.

Default Server: resolver0.dial.pipex.net.

Address: 158.43.240.4.

> 62.253.96.5.

Server: resolver0.dial.pipex.net.

Address: 158.43.240.4.

Name: inktomi2-win.server.ntl.com.

Address: 62.253.96.5.

You definetly include a check for "inktomi" into the mod...

Comment #72

I'm confused???.

I'm not too good at coding...I put in the massive load of code in the html_output.php etc that was displayed earlier in this thread..

But 2 messages up from here, there is a link to (is it another method?).

Should I replace what I had previously with the html_output, and make the changes it sais in the link above?.

I'm very confused!.

Thanks.

Anthony.

Www.gadgetworldonline.com..

Comment #73

Looks to me like one of NTL's Inktomi webcaches rather than an inktomi spider..

Comment #74

I can't believe it....Google hit my iPage site yesterday and behaved itself..

I now have a problem though.....

I'm currently using the html_output, robots and allprods solution, and it works. There were no session ids and it looked at the right pages. They aren't showing on Google yet, but hopefully they will soon. Though if Google doesn't like the pages that is another issue (Meta tags are next)..

So, do I stick with this, or go with Ian's contribution?.

Being a firm believer of 'if it aint broke dont try to fix it' I think I will leave things as they are. Bar a bit more tweaking of the robots file and the pattern matching for the spiders..

My thanks to everyone that helped with this..

Jon...

Comment #75

Hi..

I've done a little research and I want you all to know this..

The Googlebot behaviour by me..

Every three weeks count on getting a Googlebot visit. It will crawl your iPage site with a depth related to your PageRank. It will display the results of the crawling in about 24/48 hours..

After that it will visit your iPage site very frequently in the next days. If it can't crawl the whole iPage site within x days it won't update it's database. It will continue crawling your iPage site for days, maybe weeks..

Almost everytime it crawls your iPage site you'll see the results in 24 hours. And in 48 hours you might not see them again. This is normal, there's nothing wrong, it's the, and I quote, "Google Dance"..

So, if Googlebot crawled on Tuesday, the results were displayed on Wednesday, it's probable that on Thursday the new results are gone..

Just FYI so that you won't :shock: when you don't see your iPage site on Google..

If you don't get listed at all try submiting your iPage site at dmoz.org (do.

Read.

The guidelines before submiting)..

Thanks. :wink:..

Comment #76

There are 3 solutions in this thread and the new one posted by Ian C four days ago. All have come about in just a couple of weeks and so it is too early to know which will work best. Ian C's solution works in a different way to the bot catchers shown in this thread so may throw up new problems although some of us have tested it without problem - but that is not the same as saying that it will get rid of the 30,000 hits I had last month from Google..

Still, pn the whole, I am still going with the Ian C solution. But the real test will be in two weeks when I am hit by Google again..

So, Anthony, maybe you have to ask that question again in three weeks ....

Comment #77


This question was taken from a support group/message board and re-posted here so others can learn from it.