SMOLNET PORTAL home about changes
################################################################################

	A script for reemoving HTML tags
	- Willow Willis (2024-07-06)

###############################################################################

Just a dumb little bash script I wrote to help format a batch of articles from
my website in preparation for transferring them to gopherspace.

Features:
* Replaces header tags with various levels of hash marks
* Removes a lot of the common html special characters
* Converts all titles to uppercase
* Optionally adds extra newlines for every <br> or <br />

I run it on a batch of .html docs at once:

    find /posts -name "*.html" -exec ./stripHTML {} \;

Of course, the output still needs a little extra hand-formatting for 
consistency, but this saved me a bunch of time regardless.

NOTE: this script does *not* call fold on the output, so the resulting .txt
files will be too wide. It's easy to add that, but I wanted to keep a backup
of each .txt file for my records before chopping them to 80 colums.


################################################################################
### SOURCE: ###
--------------------------------------------------------------------------------
#!/bin/bash

filepath=$1
dir="$(dirname $filepath)"
filename="$(basename $filepath)"

noext="${filename%.*}"
TXT="$dir/$noext.txt"

cp $filepath $TXT

sed -i "" 's/<p>//g' $TXT
sed -i "" 's/<\/p>//g' $TXT
sed -i "" 's/<h1>/#### /g' $TXT
sed -i "" 's/<h2>/### /g' $TXT
sed -i "" 's/<h3>/## /g' $TXT
sed -i "" 's/<h4>/* /g' $TXT
sed -i "" 's/<\/h1>/ ####/g' $TXT
sed -i "" 's/<\/h2>/ ###/g' $TXT
sed -i "" 's/<\/h3>/ ##/g' $TXT
sed -i "" 's/<\/h4>/ */g' $TXT
#sed -i "" 's/<br.*>/\n/g' $TXT #uncomment to replace <br> tags with newlines
sed -i "" "s/&rsquo;/'/g" $TXT
sed -i "" "s/&lsquo;/'/g" $TXT
sed -i "" 's/&ldquo;/"/g' $TXT
sed -i "" 's/&rdquo;/"/g' $TXT
sed -i "" 's/&quot;/"/g' $TXT
sed -i "" 's/&rarr;/->/g' $TXT
sed -i "" 's/&amp;/&/g' $TXT
sed -i "" 's/&ndash;/-/g' $TXT
sed -i "" 's/&eacute;/e/g' $TXT

#Remove any remaining html tags that we don't care about
sed -e 's/<[^>]*>//g' $TXT > bar.txt

#Capitalize all the titles that we just added
perl -i -pe 's/#(.+)#/#\U$1#/gi' bar.txt

mv bar.txt $TXT
--------------------------------------------------------------------------------

### LICENSE: ###
Released under MIT license. Do whatever you want with this.
Response: text/plain
Original URLgopher://shemake.dev/0tech/posts/Strip_HTML.txt
Content-Typetext/plain; charset=utf-8