Bleet Search

Bleet Topics

Bleet Archives

January 2009
M T W T F S S
« Nov    
 1234
567891011
12131415161718
19202122232425
262728293031  

Google Indexing of Pages Without Inlinks

I believe I can prove that Google is indexing pages that have no inbound links. Thus indexing pages NOT via a typical site crawl. While I can’t prove what they are doing to discover links outside of a typical site crawl, I can prove that they are at least doing it.

Review this query:

http://www.google.com/search?hl=en&q=site%3Aorigin-www.baltimoresun.com&pws=0

And as of today (January 9, 2009) there are 16,200 pages indexed on the origin-www.baltimoresun.com subdomain.

Reviewing this query:

http://www.google.com/search?hl=en&q=link%3Aorigin-www.baltimoresun.com&pws=0

Google states that there are inbound links from these pages to origin-www.baltimoresun.com, right? Well let’s look at the code of the cached versions of the pages and see where it is then.

It doesn’t exist.

 

Okay . . . but we all know that Google’s ‘links:’ doesn’t work well so let’s look at Yahoo’s tool instead which is considered reliable in the Search industry.

http://siteexplorer.search.yahoo.com/search?p=http%3A%2F%2Forigin-www.baltimoresun.com&bwm=i&bwmo=d&bwmf=u

Looking at this query, it proves that there are no links outside the subdomain or domain linking to the site. (There will be some from within the subdomain due to relative URLs of course and the spider crawling and finding them.)

What’s really interesting is that out of the multiple Tribune domains (8 domains have this subdomain) that have an origin-www subdomain indexed, the only one that Yahoo found was the baltimoresun.com version. Furthermore, these subdomains have been live for over a full year but I just realized them over the past few days (and can prove a few have been around for at least a month, shame on me–should’ve caught this sooner perhaps). This tells me that this is a fairly recent change by Google and possibly Yahoo! (though I think, for other reasons, that Yahoo is just crawling Google’s search results).

Here are some of my theories as to what Google may be doing:

  1. Google Toolbar tracking - Obviously several Tribune employees that hit this subdomain intended for internal use have the Google Toolbar installed.
  2. Google Personalization - Whether it is by browsing history, cookies, etc. I’m not sure but several of Tribune employees have Google accounts.
  3. G’Talk - Several Tribune employees use Google’s GTalk feature and we send links of these subdomains around through GTalk. Perhaps Google is tracking GTalk URLs for discoverability.
  4. Gmail - We have a lot of dedicated employees at Tribune perhaps one of them used their personal email address when working from home to send a link from this subdomain?
  5. ???

What do you think? What could’ve caused this problem?

Also . . . seriously, the duplicate content filter didn’t catch this? Why not? You’d think with discoverability methods such as this that’d be the first thing to check.

Note: The only difference between the normal subdomain and the origin-www.baltimore.com subdomain is a server configuration. There is nothing public facing that shares any proprietary information. We only kept it ‘internal’ to avoid this exact problem from occurring (creating duplicate content). Now that it has happened anyway, there is no issue with us sharing it publicly (especially considering all the origin-www.baltimoresun.com etc. will be removed via robots.txt early next week).

Disclaimer: Now that this post exists, some inbound links may develop to the origin-www subdomains but at the time of this post I went through over 20 results for the Google ‘link:’ results and checked the cached pages. No links to the subdomain.

Comments

Comment from Dave
Time: January 9, 2009, 7:24 pm

Do the subdomains have DNS records that might be accessible?

@dsegrove

Comment from Cecily Crout
Time: January 9, 2009, 8:05 pm

Interesting stuff Brent.

My vote’s for the Google Toolbar. The other possibilities you describe seem unethical to me, I’d like to think Google’s not crawling links in personal browsing history, GTalk or Gmail.

However, I had a sub-domain indexed by Google recently that had no incoming links, and had robots.txt ‘all robots disallow’. Seems Google Toolbar was the likely culprit, I’ve seen Alexa find non-linked sub-domains this way, but the fact they ignored the robots.txt is disturbing.

Comment from Dave
Time: January 9, 2009, 8:32 pm

Hey there.. tried to snag ya via Twitter, sadly we are not acquainted. I have a suspicion it is much as you were thinking, application focus - see : http://bit.ly/qBVb

It can be used for discovery… over various platforms (have a read, all makes sense)

Hope that helps some…. seems a likely direction to look.

Comment from Stuart Livesey
Time: January 9, 2009, 10:10 pm

Brent

I’m sure that Google has the ability to find unlinked domains in a number of ways.

As long ago as 6 or 9 months ago several of us here at TWM were working on a new website for a client and emailing each other (we’re a distributed business) with thoughts, ideas and changes. The URL went backwards and forwards a number of times.

At the time I was a bit surprised to find that Google had indexed the temporary home page about a week before it was replaced with the full-blown site.

Several years ago now Matt Cutts stood up at a conference and revealed that Google could track domains via some method that he never quite got round to explaining … but was probably via by looking at what domains are on the same IP addresses.

And I’m sure Google has several other ways of finding unlinked sites. So don’t be surprised that Google found your site - we’ve been working that knowledge into our search engine strategy for new sites ever since we stumbled on it with the site that we had been emailing to one another.

Stuart

Comment from Stuart Livesey
Time: January 10, 2009, 2:36 am

Cecily take a few moments to read Gmail’s TOS and Privacy Statement. You may be surprised to discover what isn’t in there.

Stuart

Comment from Ganesh J. Acharya
Time: January 10, 2009, 2:48 am

the main domain is indexed a lot from external pages. http://bit.ly/Eo1d

Comment from James Morell
Time: January 10, 2009, 2:48 am

Brent,

I’m running an experiment to see if Google indexes sites that don’t have any external inks better than those which do with a colleague at the moment. It appears that even with very, very few or no external links you’ll be indexed, and can even rank pretty well for very targeted keywords.

I don’t know if origin-www. has been submitted to GWT, but it has a sitemap at: http://origin-www.baltimoresun.com/sitemap.xml which is a pretty good way for G to find out the structure of the site even if it’s not how it found out about the site.

James

Comment from Bashar
Time: January 10, 2009, 5:58 am

I would say Google Toolbar, Personalization, and Gmail!

Comment from Kevin Mullett
Time: January 10, 2009, 8:44 am

This video (1:18 in) explains one method they use quite often. The whole video really explains the Goog policy. Googlebot will find a way to find the fresh content.

video > http://tinyurl.com/yomhcv

I am following up my tweet with this comment.

Comment from Dave
Time: January 10, 2009, 11:45 am

I took a look at the Toolbar TOS/Privacy Policy

Here’s the link http://www.google.com/support/toolbar/bin/static.py?page=privacy.html&hl=&v=

Not exactly sure what this means, but read on…

Certain optional Toolbar features operate by sending Google the addresses or other information about sites when you visit them. Web History, PageRank, and Safe Browsing in Enhanced Mode all work this way. In addition, if you use Safe Browsing, when Google warns you about a suspicious site we may also log that site’s URL and whether you accepted, rejected, or closed the warning message. We will let you know when you are enabling a feature that automatically sends page addresses to Google, and you can turn these features off at any time.

Comment from Cecily Crout
Time: January 10, 2009, 12:47 pm

Thanks for the heads up Stuart. Guess I was wrong about Google.

I’ve been using Google Docs for a while (and now have started with Apps), I’ve been toying with the idea of using GMail, this is a deal breaker for me.

Googled the ‘Google TOS’ and #2 was this http://www.webrampage.com/gmails-tos-is-freaky-unfriendly/

It’s no wonder they offer so many services for free, or next to nothing. Information is priceless.

‘knowledge is power, knowledge is riches…’

Write a comment