|
|
THE FAR SIDE IN FRESHRSS |
|
|
|
|
|
|
|
2022-11-23 |
|
|
|
|
|
|
|
A few yeras ago, I wanted to subscribe to The Far Side's "Daily Dose" via my |
|
|
|
RSS reader. The Far Side doesn't have an RSS feed, so I implemented a |
|
|
|
proxy/middleware to bridge the two. |
|
|
|
|
|
|
|
It turns out that FreshRSS's XPath Scraping is almost enough to achieve |
|
|
|
exactly what I want. The big problem is that the image server on The Far Side |
|
|
|
website tries to prevent hotlinking by checking the Referer: header on |
|
|
|
requests, so we need a proxy to spoof that. I threw together a quick PHP |
|
|
|
program to act as a proxy (if you don't have this, you'll have to |
|
|
|
click-through to read each comic), then configured my FreshRSS feed as follows: |
|
|
|
|
|
|
 |
Screenshot showing my FreshRSS XPath configuration |
image/png |
|
|
* Feed URL: https://www.thefarside.com/ |
|
|
|
The "Daily Dose" gets published to The Far Side's homepage each day. |
|
|
|
* XPath for finding new items: //div[@class="card tfs-comic js-comic"] |
|
|
|
Finds each comic on the page. This is probably a little over-specific and |
|
|
|
brittle; I should probably switch to using the contains function at some |
|
|
|
point. I subsequently have to use parent:: and ancestor:: selectors which is |
|
|
|
usually a sign that your screen-scraping is suboptimal, but in this case it's |
|
|
|
necessary because it's only at this deep level that we start seeing really |
|
|
|
specific classes. |
|
|
|
* Item title: concat("Far Side #", parent::div/@data-id) |
|
|
|
The comics don't have titles ("The one with the cow"?), but these seem to have |
|
|
|
unique IDs in the data-id attribute of the parent <div>, so I'm using those as |
|
|
|
a reference. |
|
|
|
* Item content: descendant::div[@class="card-body"] |
|
|
|
Within each item, the <div class="card-body"> contains the comic and its text. |
|
|
|
The comic itself can't be loaded this way for two reasons: (1) the <img |
|
|
|
src="..."> just points to a placeholder (the site uses JavaScript-powered |
|
|
|
lazy-loading, ugh - the actual source is in the data-src attribute), and (2) |
|
|
|
as mentioned above, there's anti-hotlink protection we need to work around. |
|
|
|
* Item link: descendant::input[@data-copy-item]/@value |
|
|
|
Each comic does have a unique link which you can access by clicking the |
|
|
|
"share" button under it. This makes a hidden text <input> appear, which we can |
|
|
|
identify by the presence of the data-copy-item attribute. The contents of this |
|
|
|
textbox is the sharing URL for the comic. |
|
|
|
* Item thumbnail: |
|
|
|
concat("https://example.com/referer-faker.php?pw=YOUR-SECRET-PASSWORD-GOES-HERE&referer=https://www.thefarside.com/&url=", |
|
|
|
descendant::div[@class="tfs-comic__image"]/img/@data-src) |
|
|
|
Here's where I hook into my special proxy server, which spoofs the Referer: |
|
|
|
header to work around the anti-hotlinking code. If you wanted you might be |
|
|
|
able to come up with an alternative solution using a custom JavaScript loaded |
|
|
|
into your FreshRSS instance (there's a plugin for that!), perhaps to load an |
|
|
|
iframe of the sharing URL? Or you can host a copy of my proxy server yourself |
|
|
|
(you can't use mine, it's got a password and that password isn't |
|
|
|
YOUR-SECRET-PASSWORD-GOES-HERE!) |
|
|
|
* Item date: ancestor::div[@class="tfs-page__full |
|
|
|
tfs-page__full--md"]/descendant::h3 |
|
|
|
There's nothing associating each comic with the date it appeared in the Daily |
|
|
|
Dose, so we have to ascend up to the top level of the page to find the date |
|
|
|
from the heading. |
|
|
|
* Item unique ID: parent::div/@data-id |
|
|
|
Giving FreshRSS a unique ID can help it stop showing duplicates. We use the |
|
|
|
unique ID we discovered earlier; this way, if the Daily Dose does a re-run of |
|
|
|
something it already did since I subscribed, I won't be shown it again. Omit |
|
|
|
this if you want to see reruns. |
|
|
|
|
|
|
|
There's a moral to this story: when you make your website deliberately hard to |
|
|
|
consume, fewer people will access it in the way you want! The Far Side's |
|
|
|
website is actively hostile to users (JavaScript lazy-loading, anti-right |
|
|
|
click scripts, hotlink protection, incorrect MIME types, no feeds etc.), and |
|
|
|
an inevitable consequence of that is that people like me will find and share |
|
|
|
workarounds to that hostility. |
|
|
|
|
|
|
|
If you're ad-supported or collect webstats and want to keep traffic "on your |
|
|
|
site" on this side of 2004, you should make it as easy as possible for people |
|
|
|
to subscribe to content. Consider The Oatmeal or Oglaf, for example, which |
|
|
|
offer RSS feeds that include only a partial thumbnail of each comic and a link |
|
|
|
through to the full thing. I don't feel the need to screen-scrape those sites |
|
|
|
because they've given me a subscription option that works, and I routinely |
|
|
|
click-through to both of them to enjoy their latest content! |
|
|
|
|
|
|
|
Conversely, the Far Side's aggressive anti-subscription technology ultimately |
|
|
|
means that there are fewer actual visitors to their website... because folks |
|
|
|
like me work to circumvent them. |
|
|
|
|
|
|
|
And now you know how I did so. |
|
|
|
|
|
|
|
Update: want the new content that's being published to The Far Side in |
|
|
|
FreshRSS, too? I've got a recipe for that! |
|
|
|
|
|
|
|
LINKS |
|
|
|
|
|
|
 |
The Far Side (https://www.thefarside.com) |
|
|
 |
My blog post: Subscribing to The Far Side via RSS |
|
|
 |
Release tag for FreshRSS 1.20.0 (https://github.com) |
|
|
 |
FreshRSS (https://freshrss.org) |
|
|
 |
Pull request adding XPath scraping to FreshRSS (https://github.com) |
|
|
 |
My initial blog post demonstrating how to use FreshRSS's XPath scraping features (https://danq.me) |
|
|
 |
Beverley's website (https://webdevbev.co.uk) |
|
|
 |
referer-faker.php, my PHP referer:-adding proxy (https://gist.github.com) |
|
|
 |
MDN definition of the XPath contains function (https://developer.mozilla.org) |
|
|
 |
CustomJS plugin for FreshRSS (https://github.com) |
|
|
 |
The Oatmeal (https://theoatmeal.com) |
|
|
 |
Oglaf (https://www.oglaf.com) |
|
|
 |
I've got a recipe for that! |
|