Missing redirection

S

Sirius

Guest
Hi,

Recently I noticed that Google indexed pseudo subdomains concerning Sott.net. There is no redirection setup so you will end up with duplicate content, etc.—which is bad and can be solved by writing one single line of code.
But see for yourself: https://www.google.de/#hl=en&q=site:www.de.sott.net&oq=site:www.de.sott.net

Related topic: http://cassiopaea.org/forum/index.php?topic=27149.0
 
If you redirect those pages with “301 Moved Permanently”, Google will automatically recognise the change and delete the old sites from its search index. This is what's recommended.
You should also in each case specify a canonical URL, see: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=139066
 
Well... The pages haven't moved. They never should have been there in the first place. Google will catch on anyway, which is good because otherwise we'd have to differentiate between all possible valid links (and return the 301) and invalid links. Otherwise we'd end up with "moved permanently to 404 - not found".

It appears this was also screwing up the RSS feeds.

So, anybody with a malfunctioning RSS feed should check the feed URL and make sure it's not something like:

mail.es.sott.net

or

www.de.sott.net

Crikey...
 
What I meant was to use htaccess redirection like this:
Code:
RewriteCond %{HTTP_HOST} ^www.de.sott.net$  [NC]
RewriteRule ^(.*)$  http://de.sott.net/$1  [R=301,L]
You can then extend it with multiple subdomains, etc.

You need also to generate a canonical URL for each page (very important).
 
Oh! I see...

Thanks for that bit of Apache redirect stuff... That stuff always makes me cry because it never quite works the way I want it to.

I did a little test when making the 404 error work, and I found out that in about 5 minutes, 245 hits to the SOTT server had malformed URLs, like:

http://blocking.azw.sott.net/articles/show/123456

???

I also put the canonical URL link tags for articles and pages. I had always wondered what those were for, and now I know. In all the SEO reading I've done (like half of Google's content), nobody ever mentioned the canonical URL thing! Sheesh.

Thanks2

:flowers:
 
There went much more wrong:
https://www.google.de/webhp#q=inurl:*.sott.net+-inurl:de.sott+-inurl:www.sott+-inurl:es.sott+-inurl:fr.sott+-inurl:facebook.com
You need much more aggressive filtering in order to fix this (those are usually valid URLs; they are prefixed wrongly though).
1) Collect all real or valid subdomains first, e.g. de, fr, es, www, mail, etc.
2) Redirect anything from *.sott.net to sott.net except those exceptions above.

Mr. Scott said:
I had always wondered what those were for, and now I know. In all the SEO reading I've done (like half of Google's content), nobody ever mentioned the canonical URL thing! Sheesh.
It's documented in Google's webmaster section. Above I posted a link to it.
 
Sirius said:
There went much more wrong:
https://www.google.de/webhp#q=inurl:*.sott.net+-inurl:de.sott+-inurl:www.sott+-inurl:es.sott+-inurl:fr.sott+-inurl:facebook.com
You need much more aggressive filtering in order to fix this (those are usually valid URLs; they are prefixed wrongly though).
1) Collect all real or valid subdomains first, e.g. de, fr, es, www, mail, etc.
2) Redirect anything from *.sott.net to sott.net except those exceptions above.

Oh dear...

Well, I improved the regex quite a bit, but I haven't gotten the above to work yet. I need a negative lookahead, I think, but of course it doesn't work.

Oh well, there's always tomorrow!
 
On the other hand, there's also right now.

I let apache handle the basic cases with my expanded regex, and then for the "everything else -> www.sott.net", I changed the 404 redirect in the SOTT app itself to a 301 redirect to www.sott.net/BLAH.

That will take care of the "_www.dildomania-info.sott.net_" links on Google, with a proper 301 redirect to the real article, AND it has the canonical link in the page.

Not ideal since mod_rewrite is faster, but it works.

Now I can sleep soundly.

:zzz:
 
One last thought: At first, I thought this was Google doing it's "I'll try a URL that I know is wrong to see if I get the proper 404 error message".

But, after you posted that Google link above Sirius, I'm thinking somebody has been doing "reverse site promotion" for SOTT, trying really hard to get us dropped in the rankings.

Hmm.
 
There are not necessarily bad guys out there doing it deliberately. As long as it is possible to use arbitrary subdomains, it will happen somehow.

For example, there be a list with two domains somewhere in a search index of some search engine:
domain.com
www.sott.net
Then a line break occurs and you have domain.comsott.net which I have seen (another domain name of course).

Someone posts an URL in obfuscated form like ww w.sott.net / article … and you get the problem.
For example here: _http://www.youtube.com/watch?v=dZk0ZGNHkoQ
There is a user comment (obviously written with a malfunctioning keyboard):
comment said:
ht tp:/ /cryptome. org/eyeball/daiichi-npp15/daii­chi-photos15.htm Cryptome.org has info that everyone imo should know + ht tp://w w w.sott.net/articles/show/22893­3-HAARP-and-The-Canary-in-the-­Mine starts in 2004 to present day. Dunno if yt purged my previous comment or you did. If you I appreciate the protection 4 me effort. Shared sott.net also with henning. The rest of this with you.  TC.
And so on …
On the other hand, if you look at constructions like zihggbu.sott.net or xxx.sott.net or 66.sott.net, one might really wonder what actually happened.

Try this one for redirection (add subdomains as you wish):
Code:
# Subdomain handling
<IfModule mod_rewrite.c>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_HOST} ^.+(www|de|es|fr|mail)\.sott\.net$ [NC]
RewriteRule ^(.*)$ http://%1.sott.net/$1 [R=301,L]
RewriteCond %{HTTP_HOST} !^(www|de|es|fr|mail)\.sott\.net$ [NC]
RewriteRule ^(.*)$ http://www.sott.net/$1 [R=301,L]
</IfModule>
Don't use the other code, the dots are not escaped there. It was only an illustration.

You should also add to your robots.txt file certain pages which are not intended for search engines. Append:
Code:
User-agent: *
Disallow: /users/login
Disallow: /users/signup
Add further pages if necessary.

Another SEO issue is archive pages like http://www.sott.net/signs/archive/en/2012/signs20120906.htm
They may be visited by search engines but should not be indexed. They displace article pages and actual content. Such pages should be configured with:
Code:
<meta name="robots" content="noindex, follow">

There is also redirection missing concerning ending slashes. For example:
users/login and users/login/ coexist! This is also where canonical comes in handy.
 
Sirius said:
There are not necessarily bad guys out there doing it deliberately. As long as it is possible to use arbitrary subdomains, it will happen somehow.

For example, there be a list with two domains somewhere in a search index of some search engine:
domain.com
www.sott.net
Then a line break occurs and you have domain.comsott.net which I have seen (another domain name of course).

That makes sense.

Sirius said:
On the other hand, if you look at constructions like zihggbu.sott.net or xxx.sott.net or 66.sott.net, one might really wonder what actually happened.

Yeah, that's what I am wondering about. That and the more racy links!

Sirius said:
Try this one for redirection (add subdomains as you wish):

:headbash:

So simple! It always is, once you see how to do it. Made a few tweaks, and it ended being way shorter than my solution last night.

Sirius said:
You should also add to your robots.txt file certain pages which are not intended for search engines.

Okay, I added them.

Sirius said:
Another SEO issue is archive pages like http://www.sott.net/signs/archive/en/2012/signs20120906.htm
They may be visited by search engines but should not be indexed. They displace article pages and actual content.

DOH! I went in and added the meta tags recursively to the existing files, and all new ones from here on out will have the meta tag added.

Sirius said:
There is also redirection missing concerning ending slashes. For example:
users/login and users/login/ coexist! This is also where canonical comes in handy.

Okay, I fixed the trailing slash, too. Hmm... Okay, I also added the canonical URL for some more pages on the site.

Geez... All this work just to make Google happy.

Thanks again for your help!
 
Back
Top Bottom