An end to my referrer abuse

Amen. I’ve always found it irritating that news aggregators insert their URL into the referrer field. ... It would be nice if there was some sort of browser header the aggregator could send to identify itself instead of using the referrer field. Oh, that’s right, there is. It’s called User-Agent.

The user agent field is designed for browsers, robots, and other user agents to identify themselves to the Web server. You can even add additional information, like a contact URL or email address. I’d like to see aggregators start using it.

Hmm, being mostly a standards neophyte, I thought this was a great idea, you know, NeatLikeDigitalWatches. I thought this was more a semi-clever overloading of the referer, rather than outright abuse. And this, I thought, was reasonably okay since there wasn't, I thought, anywhere else to stick a backlink to myself while consuming RSS feeds.

Well, yeah, now that I read some of the complaints against this use of referers, I agree. And, yes, now that I read the fine RFC, I see that the User-Agent string is more appropriate for this purpose.

So! From now on, hits from my copy of AmphetaDesk will leave behind a User-Agent string similar to this:

"AmphetaDesk/0.93 (darwin; http: //www.disobey.com/amphetadesk/; http: //www.decafbad.com/thanks-for-feeding-me.phtml)"

I tack my own personal thanks URL onto the end of the list within the parenthesis. In addition, I no longer send a referrer string when I download RSS feeds. How did I do it? Very simply.

First, I modify my AmphetaDesk/data/mySettings.xml file by hand to supply a blank referer and a new user URL (having some angle-bracket problems, bear with me):

[user]

    ...

    [http_referer][/http_referer]

    [user_url]http://www.decafbad.com/thanks-for-feeding-me.phtml[/user_url]

    ...

[/user]

Second, I modified AmphetaDesk/lib/AmphetaDesk/Settings.pm to account for the new setting:

...

$SETTINGS{user_http_referer} = "http://www.disobey.com/amphetadesk/";

$SETTINGS{user_user_url} = "http://www.disobey.com/amphetadesk/";

$SETTINGS{user_link_target} = "_blank";

...

Third, I modified the create_ua() subroutine in AmphetaDesk/lib/AmphetaDesk/WWW.pm to actually use the new setting:

sub create_ua {

...

    my $ua = new LWP::UserAgent; $ua->env_proxy();

    $ua->timeout(get_setting("user_request_timeout"));

    my ($app_v, $app_u, $app_o, $user_u) = (get_setting("app_version"),

            get_setting("app_url"), get_setting("app_os"), get_setting("user_user_url"));

    $ua->agent("AmphetaDesk/$app_v ($app_o; $app_u; $user_u)");

...

}

And voila - no more referer abuse. If you want to discover my thank-you message, examine the User-Agent string. Seems like this would be a good idea for all news aggregators to pick up. And if I get ambitious and have spare time today, I'll be sending off a patch to Morbus & friends later today.

Update: Gagh! This has been the hardest post to try to format correctly within the fancy schmancy auto-formatting widgets I have piped together. All apologies for content resembling garbage. I think I'll use this excuse in the future whenever I write something completely daft. (Which means I'll be using it a lot, most likely.)

shortname=ooodoe

Archived Comments

  • Ok, I just made the change, keep an eye on your logs...
  • News aggregators *could* provide an appropriate referrer, possibly one of: permalink of the item containing the followed link or (when a permalink is not available) the RSS feed containing the follwed link. Isn't this how the referrer is supposed to be used?
  • Why is it better to abuse the User-Agent header than the Referer header? According to the RFC link http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.43 the user agent request header "contains information about the user agent originating the request", specifically "[t]he field can contain multiple product tokens [...] and comments identifying the agent and any subproducts which form a significant part of the user agent." The url "http://www.decafbad.com/thanks-for-feeding-me.phtml" doesn't really identify a product (the aggregator itself) or a subproduct (presumably libraries used by the aggregator, like the Python/Java runtime, the .NET Framework, PERL etc.) My reading of the RFC might be incorrect, but I don't see anything about what was originally the whole point of this excersise - to provide feedback (no pun intended) to feed producers about who was reading them etc. - by providing links back to a specific user's weblog (or any kind of "trail" back to any other url-identifiable resource for that matter.) In theory, I guess you could say that your subscriptions file ("myChannels.opml" or whatever it is called in your aggregator) is a part/subproduct of your aggregator since it provides the configuration settings about which feeds to download. So you could probably rename your configuration file to something that uniquely referred to your weblog ("www.decafbad.com-daily-reading.opml") and add that to your User-Agent header after the standard Amphetadesk header bits. A naming convention (like the old "userWeblog=...") that could be used across different weblogs would help feed producers *if* they had access to User-Agent header information. However, if the only objections against the current practice of overloading the referer header is that 1) user-agent information shouldn't be in the referer header, and 2) that "[t]he Referer field MUST NOT be sent if the Request-URI was obtained from a source that does not have its own URI", then we could just publish our subscription files on the web (so they have their own URIs) under a descriptive name ("http://www.decafbad.com/feeds-i-subscribe-to.opml") and use that as the (now legal) referer header? This would help feed producers who have access to Referer reports. I'm sure I'm missing something. Like the ultra-precisely defined header that identifies individual webloggers and/or their weblogs in a flexible, decentralized fashion. Seems like a good thing to have for a more two-way web and all that...
  • If you are a paranoid sysadmin type like me, you have a file of 1500 or so User Agents and rotate them, along with the referrer, et al, for every single Web transaction. It makes this type of concern meaningless, or at least misguided. But then, that assumes you want privacy. Hmm. Fame is expensive.
  • Use the %ENV, Luke. It's not guaranteed to work, every time, but it's worth a try. It is unclear to me where %ENV intersects with the OS X property list. It can't be that hard to suss, though. my $user_u = get_setting("user_user_url")) || $ENV{WWW_HOME} || &generate_random_uri();
  • Although not a horrible method, I don't like it any better than messing up the referrer. See, the problem with this one is that I'd have a bazillion different user agent strings floating around my statistics - right now, I see amphetadesk, aggie, etc... with your method, there would be tons of different amphetadesk, aggie, etc strings in the user agent field. bah.
  • Eric: I think I'm with you - I'm not sure I'm completely in love with this change, but figured I'd try it and see reactions since it was such a small tweak. I'm in the midst of putting together a web app reporting framework for a project at work, and divining any usable information out of User-Agents is an occult art at best. Hell, some I've seen have even been including small HTML pages in their User-Agent strings. Gagh.
Unplugging to find a clock again?  Previous INN + blagg + plugin = News Aggregation via NNTP Next