|
|
SUBSCRIBING TO FORWARD USING FRESHRSS'S XPATH SCRAPING |
|
|
|
|
|
|
|
2023-03-28 |
|
|
|
|
|
|
|
As I've mentioned before, I'm a fan of Tailsteak's Forward comic. I'm not a |
|
|
|
fan of the author's weird aversion to RSS, so I hacked a way around it first |
|
|
|
using an exploit in webcomic reader app Comic Chameleon (accidentally getting |
|
|
|
access to comics weeks in advance of their publication as a side-effect) and |
|
|
|
later by using my own tool RSSey. |
|
|
|
|
|
|
|
But now I'm able to use my favourite feed reader FreshRSS to scrape websites |
|
|
|
directly - like I've done for The Far Side - I should switch to using this |
|
|
|
approach to subscribe to Forward, too: |
|
|
|
|
|
|
 |
Screenshot showing RSS feed items: recent Forward episodes including their numbers, titles, and publication dates. |
image/png |
|
|
|
|
|
|
Here's the settings I came up with - |
|
|
|
* Feed URL: http://forwardcomic.com/list.php |
|
|
|
* Type of feed source: HTML + XPath (Web scraping) |
|
|
|
* XPath for finding news items: //a[starts-with(@href,'archive.php')] |
|
|
|
* Item title: . |
|
|
|
* Item link (URL): ./@href |
|
|
|
* Item date: ./following-sibling::text()[1] |
|
|
|
* Custom date/time format: - Y.m.d |
|
|
|
|
|
|
 |
Annotated screenshot showing how each XPath directive maps to each part of the page. The item selector finds each hyperlink that begins with "archive.php" (notably missing the most-recent comic at any given time, which is found at index.php), and the date is found in the text node that immediately follows it, in a slightly-unusual variation on ISO8601. |
image/png |
|
|
|
|
|
|
I continue to love this "killer feature" of FreshRSS, but I'm beginning to see |
|
|
|
how it could go further - I wish I had the free time to contribute to its |
|
|
|
development! |
|
|
|
|
|
|
|
I'd love to see a mechanism for exporting/importing feed configurations like |
|
|
|
this so that I could share them more-easily, for example. I'd also be |
|
|
|
delighted if I could expand on my XPath rules to load pages referenced by the |
|
|
|
results and get data from them, too, e.g. so I could use an image found by |
|
|
|
XPath on the "item link" page as the thumbnail image! These are things RSSey |
|
|
|
could do for me, but FreshRSS can't... yet! |
|
|
|
|
|
|
|
LINKS |
|
|
|
|
|
|
 |
My blog post promoting Forward as it reached episode #100 (https://danq.me) |
|
|
 |
Tailsteak (http://tailsteak.com) |
|
|
 |
Forward (http://forwardcomic.com) |
|
|
 |
Tailsteak posts in his official forums perhaps at the moment that he first fell out of love with RSS? (https://www.leftoversoup.com) |
|
|
 |
My blog post about hacking Comic Chameleon (https://danq.me) |
|
|
 |
My RSSey code to turn Forward Comic into an RSS feed (https://github.com) |
application/javascript |
|
 |
My blog post about using FreshRSS's XPath feature to subscribe to my friend Beverley's weblog (https://danq.me) |
|
|
 |
FreshRSS (https://freshrss.org) |
|
|
 |
My blog post about using FreshRSS's XPath scraping to subscribe to The Far Side |
|