Google no-shows

Matt Webb, remarking on Google’s purchase of Pyra, hence of Blogger:

Metadata-for-free is an important one. Imagine Google noticing that the HTTP_REFERER is a particular article. If they haven’t got that in Google News, it needs scraping. If there are a lot of referrers, that’s important too.

Here is part of my experience with Google, which I am having trouble understanding.

If I enter a search term that is not actually found, it seems to stay unfound forever.
For ordinary text entered as a search term, I don’t see what Google could do to fix the problem, except maybe store the search terms up and rerun them every three or four weeks (as the entire database is refreshed), but the result then would be to ascertain if there are now any results for previous searches, which isn’t very useful.
For a URL entered as a search term, though, that should be a trigger for the Googlebot to immediately download that URL. Links are very important to the Google PageRank. The fact that I entered a URL into the search box is a significant link.
The foregoing remains true with file formats other than HTML, particularly PDF.

For a few weeks, I’ve been keeping track of Google searches that come up blank. (Why so little time? If I don’t write this now, I’ll never get around to it. How do I keep track? By mailing myself the results in Lynx, an incalculably convenient technique.) Today I re-checked those links, and they all still come up blank:

Naturally, my including them on this page may be enough to get them indexed.

Solution

This one’s easy. Google should fetch any URL that comes up with no search results.

Issues with PDF

Google indexes PDFs and many other document types. In certain cases, Google results let you read the contents of the original file transcoded into HTML. But if you’re using a text-only browser, Google will also give you a link to a text-only transcoding of a PDF.

A PDF author can disallow indexing, “content extraction,” or reflow. Doing so also breaks accessibility support – in 40-bit-encryption mode, at least. (It says so right there in the dialogue box: “No Content Copying or Extraction, Disable Accessibility,” which is perhaps not the best turn of phrase. 128-bit encryption is no problem: “Enable Content Access for the Visually Impaired” and “Allow Content Copying and Extraction” are separate switches turned on by default.)

If you use Google to search for a URL of a PDF document with content extraction turned off, a graphical browser will return a result like this (skip):

[PDF]Ministerial Inquiry into Telecommunications - Final Report
File Format: PDF/Adobe Acrobat
Page 1. Ministerial Inquiry into Telecommunications Final Report 27
September 2000 Page 2. Ministerial Inquiry into Telecommunications ... 

Google can show you the following information for this URL:

* Find web pages that are similar to www.teleinquiry.govt.nz/reports/final/final.pdf 
* Find web pages that link to www.teleinquiry.govt.nz/reports/final/final.pdf 
* Find web pages that contain the term "www.teleinquiry.govt.nz/reports/final/final.pdf"

In other words, Google gives you no option to view the PDF in HTML. Google honours the no-extract setting of the original.

If, however, you do the same search in a text-only browser, the results are as follows:

Lynx (skip)

[PDF][9]Ministerial Inquiry into Telecommunications - Final Report
File Format: PDF/Adobe Acrobat
Your browser may not have a PDF reader available. Google recommends
visiting our [10]text version of this document.
Page 1. Ministerial Inquiry into Telecommunications Final Report 27
September 2000 Page 2. Ministerial Inquiry into Telecommunications ...

Google can show you the following information for this URL:
 * Find web pages that are [11]similar to
   www.teleinquiry.govt.nz/reports/final/final.pdf
 * Find web pages that [12]link to
   www.teleinquiry.govt.nz/reports/final/final.pdf
 * Find web pages that [13]contain the term
   "www.teleinquiry.govt.nz/reports/final/final.pdf"

Google attempts to give you a text-only version of the PDF. In fact, Google always does so for a text-only browser.

Lynx numbers your hyperlinks for you, which is quite the convenience. Hyperlink 10 above leads to http://www.google.com/webhp?hl=en, an alternate of the Google homepage. Lynx will not load that page, or at least it appears to reload the original search page.

Links (skip)

The confusingly-named Links browser is another text-only device. Its results look the same, really:

[PDF][11]Ministerial Inquiry into Telecommunications - Final Report
File Format: PDF/Adobe Acrobat
Your browser may not have a PDF reader available. Google recommends
visiting our [12]text version of this document.
Page 1. Ministerial Inquiry into Telecommunications Final Report 27
September 2000 Page 2. Ministerial Inquiry into Telecommunications ...

Google can show you the following information for this URL:
 * Find web pages that are [13]similar to
   www.teleinquiry.govt.nz/reports/final/final.pdf
 * Find web pages that [14]link to
   www.teleinquiry.govt.nz/reports/final/final.pdf
 * Find web pages that [15]contain the term
   "www.teleinquiry.govt.nz/reports/final/final.pdf"

Links also numbers your hyperlinks for you. Hyperlink 12 above leads to http://www.google.com/ but redirects to http://www.google.ca/ (Here Google sends me to the Canadian site, which it never, ever does otherwise, something I am grateful for.)

The search results are not the same for Lynx and Links, inexplicably.

W3M

W3M is orders of magnitude more obscure even than Links and does exactly the same thing as Links.

Solution

First of all, the option to view a PDF in text only is much more useful in a graphical browser than a text-only one. Whether the source is HTML or plain text, the only possible result in a text-only browser is plain text. In a graphical browser, it is often tedious to watch the browser attempt to render massively complicated transcoded HTML. The option for text-only viewing should be available to all browsers.

However, when a PDF author turns off content extraction, Google should not even show a link to a text-only version. There is no such version, and the implementation of that feature is itself broken.

Google tracks your searches

Yes, I read the Orwellian exposé of Google’s suspected practices. Of course they’re not telling us everything they’re doing with our searches and the information they learn from the cookies installed on our machines. Whether or not that information is used perniciously is another question, but they’re not telling us everything.

But there are further advantages to using text-only browsers, as it turns out. It is a snap to see the URL that a link leads you to. Just place the cursor on that link (in Lynx, type the number plus g, as 12g) and look at the status line. If the link is too long for the line, press = and you’ll see the whole thing on a separate screen. I am aware that graphical browsers do the same, but it’s more convenient in, say, Lynx.

I’ve found a couple of examples of Google’s apparently tracking my queries, though I do not suggest that Google knows who “I” am or that there is any conscious targeting. As I wrote previously:

I’ve seen results listings like these on Google twice now:

Linkname: Residence Life - Virtual Reality Project

URL: http://www.google.com/url? sa=U&start=3&q= http://www.life.arizona.edu/vr/ detailed.asp&e=42

Of interest are the interposed characters http://www.google.com/url?sa=U&start=3&q=. The cursor was sitting on the link whose URL should simply have been http://www.life.arizona.edu/vr/ detailed.asp&e=42.

On this and the other occasion, neither of which I was able to reproduce (not even by re-feeding the same search queries in a new session), Google seems to be tracking random search requests.

I have not witnessed this phenomenon any more times since first writing about it.

You were here: fawny.org → Google no-shows