Updated 2003.03.22
Matt Webb, remarking on Google’s purchase of Pyra, hence of Blogger:
Metadata-for-free is an important one. Imagine Google noticing that the
HTTP_REFERER
is a particular article. If they haven’t got that in Google News, it needs scraping. If there are a lot of referrers, that’s important too.
Here is part of my experience with Google, which I am having trouble understanding.
For a few weeks, I’ve been keeping track of Google searches that come up blank. (Why so little time? If I don’t write this now, I’ll never get around to it. How do I keep track? By mailing myself the results in Lynx, an incalculably convenient technique.) Today I re-checked those links, and they all still come up blank:
baka.k2r.net/sconv/
mtg.client.shareholder.com/downloads/Q42002stat.pdf
home.c2i.net/st_hall/eudora/
www.bbc.co.uk/commissioning/bbci/pdf/BBCi_Accessibility_Study_7-10-02. pdf
www.techdis.ac.uk/seven/papers/dyslexia.doc
www.actra.ca/actra/images/03march/CCAU.EXEC.SUMM.pdf
& CCAUCrisis.pdf
Naturally, my including them on this page may be enough to get them indexed.
This one’s easy. Google should fetch any URL that comes up with no search results.
Google indexes PDFs and many other document types. In certain cases, Google results let you read the contents of the original file transcoded into HTML. But if you’re using a text-only browser, Google will also give you a link to a text-only transcoding of a PDF.
A PDF author can disallow indexing, “content extraction,” or reflow. Doing so also breaks accessibility support – in 40-bit-encryption mode, at least. (It says so right there in the dialogue box: “No Content Copying or Extraction, Disable Accessibility,” which is perhaps not the best turn of phrase. 128-bit encryption is no problem: “Enable Content Access for the Visually Impaired” and “Allow Content Copying and Extraction” are separate switches turned on by default.)
If you use Google to search for a URL of a PDF document with content extraction turned off, a graphical browser will return a result like this (skip):
[PDF]Ministerial Inquiry into Telecommunications - Final Report File Format: PDF/Adobe Acrobat Page 1. Ministerial Inquiry into Telecommunications Final Report 27 September 2000 Page 2. Ministerial Inquiry into Telecommunications ... Google can show you the following information for this URL: * Find web pages that are similar to www.teleinquiry.govt.nz/reports/final/final.pdf * Find web pages that link to www.teleinquiry.govt.nz/reports/final/final.pdf * Find web pages that contain the term "www.teleinquiry.govt.nz/reports/final/final.pdf"
In other words, Google gives you no option to view the PDF in HTML. Google honours the no-extract setting of the original.
If, however, you do the same search in a text-only browser, the results are as follows:
[PDF][9]Ministerial Inquiry into Telecommunications - Final Report File Format: PDF/Adobe Acrobat Your browser may not have a PDF reader available. Google recommends visiting our [10]text version of this document. Page 1. Ministerial Inquiry into Telecommunications Final Report 27 September 2000 Page 2. Ministerial Inquiry into Telecommunications ... Google can show you the following information for this URL: * Find web pages that are [11]similar to www.teleinquiry.govt.nz/reports/final/final.pdf * Find web pages that [12]link to www.teleinquiry.govt.nz/reports/final/final.pdf * Find web pages that [13]contain the term "www.teleinquiry.govt.nz/reports/final/final.pdf"
Google attempts to give you a text-only version of the PDF. In fact, Google always does so for a text-only browser.
Lynx numbers your hyperlinks for you, which is quite the convenience. Hyperlink 10 above leads to http://www.google.com/webhp?hl=en, an alternate of the Google homepage. Lynx will not load that page, or at least it appears to reload the original search page.
The confusingly-named Links browser is another text-only device. Its results look the same, really:
[PDF][11]Ministerial Inquiry into Telecommunications - Final Report File Format: PDF/Adobe Acrobat Your browser may not have a PDF reader available. Google recommends visiting our [12]text version of this document. Page 1. Ministerial Inquiry into Telecommunications Final Report 27 September 2000 Page 2. Ministerial Inquiry into Telecommunications ... Google can show you the following information for this URL: * Find web pages that are [13]similar to www.teleinquiry.govt.nz/reports/final/final.pdf * Find web pages that [14]link to www.teleinquiry.govt.nz/reports/final/final.pdf * Find web pages that [15]contain the term "www.teleinquiry.govt.nz/reports/final/final.pdf"
Links also numbers your hyperlinks for you. Hyperlink 12 above leads to http://www.google.com/
but redirects to http://www.google.ca/ (Here Google sends me to the Canadian site, which it never, ever does otherwise, something I am grateful for.)
The search results are not the same for Lynx and Links, inexplicably.
W3M is orders of magnitude more obscure even than Links and does exactly the same thing as Links.
First of all, the option to view a PDF in text only is much more useful in a graphical browser than a text-only one. Whether the source is HTML or plain text, the only possible result in a text-only browser is plain text. In a graphical browser, it is often tedious to watch the browser attempt to render massively complicated transcoded HTML. The option for text-only viewing should be available to all browsers.
However, when a PDF author turns off content extraction, Google should not even show a link to a text-only version. There is no such version, and the implementation of that feature is itself broken.
Yes, I read the Orwellian exposé of Google’s suspected practices. Of course they’re not telling us everything they’re doing with our searches and the information they learn from the cookies installed on our machines. Whether or not that information is used perniciously is another question, but they’re not telling us everything.
But there are further advantages to using text-only browsers, as it turns out. It is a snap to see the URL that a link leads you to. Just place the cursor on that link (in Lynx, type the number plus g, as 12g) and look at the status line. If the link is too long for the line, press = and you’ll see the whole thing on a separate screen. I am aware that graphical browsers do the same, but it’s more convenient in, say, Lynx.
I’ve found a couple of examples of Google’s apparently tracking my queries, though I do not suggest that Google knows who “I” am or that there is any conscious targeting. As I wrote previously:
I’ve seen results listings like these on Google twice now:
Linkname: Residence Life - Virtual Reality Project
URL:
http://www.google.com/url?
sa=U&start=3&q=
http://www.life.arizona.edu/vr/
detailed.asp&e=42Of interest are the interposed characters
http://www.google.com/url?sa=U&start=3&q=
. The cursor was sitting on the link whose URL should simply have beenhttp://www.life.arizona.edu/vr/
.
detailed.asp&e=42On this and the other occasion, neither of which I was able to reproduce (not even by re-feeding the same search queries in a new session), Google seems to be tracking random search requests.
I have not witnessed this phenomenon any more times since first writing about it.
You were here: fawny.org → Google no-shows
See also: fawny.blog
Comments? Questions? Concerns? E-mail Joe Clark.