iREADME - webdump - HTML to plain-text converter for webpages Err codemadness.org 70 hgit clone git://git.codemadness.org/webdump URL:git://git.codemadness.org/webdump codemadness.org 70 1Log /git/webdump/log.gph codemadness.org 70 1Files /git/webdump/files.gph codemadness.org 70 1Refs /git/webdump/refs.gph codemadness.org 70 1README /git/webdump/file/README.gph codemadness.org 70 1LICENSE /git/webdump/file/LICENSE.gph codemadness.org 70 i--- Err codemadness.org 70 iREADME (3219B) Err codemadness.org 70 i--- Err codemadness.org 70 i 1 webdump Err codemadness.org 70 i 2 ------- Err codemadness.org 70 i 3 Err codemadness.org 70 i 4 HTML to plain-text converter tool. Err codemadness.org 70 i 5 Err codemadness.org 70 i 6 It reads HTML in UTF-8 from stdin and writes plain-text to stdout. Err codemadness.org 70 i 7 Err codemadness.org 70 i 8 Err codemadness.org 70 i 9 Build and install Err codemadness.org 70 i 10 ----------------- Err codemadness.org 70 i 11 Err codemadness.org 70 i 12 $ make Err codemadness.org 70 i 13 # make install Err codemadness.org 70 i 14 Err codemadness.org 70 i 15 Err codemadness.org 70 i 16 Dependencies Err codemadness.org 70 i 17 ------------ Err codemadness.org 70 i 18 Err codemadness.org 70 i 19 - C compiler. Err codemadness.org 70 i 20 - libc + some BSDisms. Err codemadness.org 70 i 21 Err codemadness.org 70 i 22 Err codemadness.org 70 i 23 Usage Err codemadness.org 70 i 24 ----- Err codemadness.org 70 i 25 Err codemadness.org 70 i 26 Example: Err codemadness.org 70 i 27 Err codemadness.org 70 i 28 url='https://codemadness.org/sfeed.html' Err codemadness.org 70 i 29 Err codemadness.org 70 i 30 curl -s "$url" | webdump -r -b "$url" | less Err codemadness.org 70 i 31 Err codemadness.org 70 i 32 curl -s "$url" | webdump -8 -a -i -l -r -b "$url" | less -R Err codemadness.org 70 i 33 Err codemadness.org 70 i 34 curl -s "$url" | webdump -s 'main' -8 -a -i -l -r -b "$url" | less -R Err codemadness.org 70 i 35 Err codemadness.org 70 i 36 Err codemadness.org 70 i 37 Yes, all these option flags look ugly, a shellscript wrapper could be used :) Err codemadness.org 70 i 38 Err codemadness.org 70 i 39 Err codemadness.org 70 i 40 Goals / scope Err codemadness.org 70 i 41 ------------- Err codemadness.org 70 i 42 Err codemadness.org 70 i 43 The main goal is to use it for converting HTML mails to plain-text and to Err codemadness.org 70 i 44 convert HTML content in RSS feeds to plain-text. Err codemadness.org 70 i 45 Err codemadness.org 70 i 46 The tool will only convert HTML to stdout, similarly to links -dump or lynx Err codemadness.org 70 i 47 -dump but simpler and more secure. Err codemadness.org 70 i 48 Err codemadness.org 70 i 49 - HTML and XHTML will be supported. Err codemadness.org 70 i 50 - There will be some workarounds and quirks for broken and legacy HTML code. Err codemadness.org 70 i 51 - It will be usable and secure for reading HTML from mails and RSS/Atom feeds. Err codemadness.org 70 i 52 - No remote resources which are part of the HTML will be downloaded: Err codemadness.org 70 i 53 images, video, audio, etc. But these may be visible as a link reference. Err codemadness.org 70 i 54 - Data will be written to stdout. Intended for plain-text or a text terminal. Err codemadness.org 70 i 55 - No support for Javascript, CSS, frame rendering or form processing. Err codemadness.org 70 i 56 - No HTTP or network protocol handling: HTML data is read from stdin. Err codemadness.org 70 i 57 - Listings for references and some options to extract them in a list that is Err codemadness.org 70 i 58 usable for scripting. Some references are: link anchors, images, audio, video, Err codemadness.org 70 i 59 HTML (i)frames, etc. Err codemadness.org 70 i 60 Err codemadness.org 70 i 61 Err codemadness.org 70 i 62 Features Err codemadness.org 70 i 63 -------- Err codemadness.org 70 i 64 Err codemadness.org 70 i 65 - Support for word-wrapping. Err codemadness.org 70 i 66 - A mode to enable basic markup: bold, underline, italic and blink ;) Err codemadness.org 70 i 67 - Indentation of headers, paragraphs, pre and list items. Err codemadness.org 70 i 68 - Basic support to query an element or hide them. Err codemadness.org 70 i 69 - Show link references. Err codemadness.org 70 i 70 - Show link references and resources such as img, video, audio, subtitles. Err codemadness.org 70 i 71 - Export link references and resources to a TAB-separated format. Err codemadness.org 70 i 72 Err codemadness.org 70 i 73 Err codemadness.org 70 i 74 Trade-offs Err codemadness.org 70 i 75 ---------- Err codemadness.org 70 i 76 Err codemadness.org 70 i 77 All software has trade-offs. Err codemadness.org 70 i 78 Err codemadness.org 70 i 79 webdump processes HTML in a single-pass. It does not buffer the full DOM tree. Err codemadness.org 70 i 80 Although due to the nature of HTML/XML some parts like attributes need to be Err codemadness.org 70 i 81 buffered. Err codemadness.org 70 i 82 Err codemadness.org 70 i 83 Rendering tables in webdump is very limited. Twibright Links has really nice Err codemadness.org 70 i 84 table rendering. Implementing a similar feature in the current design of Err codemadness.org 70 i 85 webdump would make the code much more complex however. Twibright links Err codemadness.org 70 i 86 processes a full DOM tree and processes the tables in multiple passes (to Err codemadness.org 70 i 87 measure the table cells) etc. Of course tables can be nested also, or is used Err codemadness.org 70 i 88 in (older web) pages that use HTML tables for layout. Err codemadness.org 70 i 89 Err codemadness.org 70 i 90 These trade-offs and preferences are chosen for now. It may change in the Err codemadness.org 70 i 91 future. Fortunately there are the usual good suspects for HTML to plain-text Err codemadness.org 70 i 92 conversion, (each with their own chosen trade-offs of course): Err codemadness.org 70 i 93 Err codemadness.org 70 i 94 For example: Err codemadness.org 70 i 95 Err codemadness.org 70 i 96 - twibright links Err codemadness.org 70 i 97 - lynx Err codemadness.org 70 i 98 - w3m Err codemadness.org 70 i 99 Err codemadness.org 70 i 100 Err codemadness.org 70 i 101 Examples Err codemadness.org 70 i 102 -------- Err codemadness.org 70 i 103 Err codemadness.org 70 i 104 To use webdump as a HTML to text filter for example in the mutt mail client, Err codemadness.org 70 i 105 change in ~/.mailcap: Err codemadness.org 70 i 106 Err codemadness.org 70 i 107 text/html; webdump -i -l -r < %s; needsterminal; copiousoutput Err codemadness.org 70 i 108 Err codemadness.org 70 i 109 In mutt you should then add: Err codemadness.org 70 i 110 Err codemadness.org 70 i 111 auto_view text/html Err codemadness.org 70 i 112 Err codemadness.org 70 i 113 Err codemadness.org 70 i 114 License Err codemadness.org 70 i 115 ------- Err codemadness.org 70 i 116 Err codemadness.org 70 i 117 ISC, see LICENSE file. Err codemadness.org 70 i 118 Err codemadness.org 70 i 119 Err codemadness.org 70 i 120 Author Err codemadness.org 70 i 121 ------ Err codemadness.org 70 i 122 Err codemadness.org 70 i 123 Hiltjo Posthuma Err codemadness.org 70 .