update shp blog post, hopefully 4 final time

2024-04-03 01:20:09 -04:00 · 2024-04-03 01:20:09 -04:00 · 670df0a0cd
parent 2d3bcf0bd4
commit 670df0a0cd
1 changed files with 80 additions and 16 deletions
--- a/src/blog/shp-hypertext-processor.md
+++ b/src/blog/shp-hypertext-processor.md
@ -1,23 +1,23 @@
 ---
 title = shell: hypertext processor
 ctime = 2023-11-20T23:40:00Z
-mtime = 2024-04-01T:00:10:00Z
+mtime = 2024-04-03T06:08:23Z
 uuid = 14f1934b-76d1-4dd4-ae1c-65bf0e305eea
 TB_TITLE = xterm - /bin/sh build.sh
 ---

 that name might ring a few alarm bells to anyone with a sense of morality. alas, necessity is the mother of invention, and I have willed this into existence

-in normal speak, I have created probably the most cursed static site generator ever, by combining the inlining of php with the language of shell. I have spent too much time to refine this to a state where it's basically the same style of programming as php that I have looped back to being proud of it
+in normal speak, I have created probably the most cursed static site generator ever, by combining the inlining of php with the language of shell. I have spent too much time to refine this to a state where it's basically frictionless for me to use that I have looped back to being proud of it (I think this is called stockholm syndrome?)

 ## but why
-this site you're reading right now is hosted on [unix.dog](https://unix.dog), a pubnix. when I got a unix.dog account, there was no templating software such as php, hugo, etc. due to this, I was [making most of my pages manually](https://git.unix.dog/yosh/website/src/commit/62a41d9c17460dd1f445063f4f9aec8200891c52) to start off, which I knew very well wasn't going to scale. I can't remember if between this time and when I set up my first draft that php was added, but either way, I *thought* there wasn't any templating software still. I had two options: create my own using resources already on the system, or compile one
+this site you're reading right now is hosted on [unix.dog](https://unix.dog), a pubnix that doesn't really want users hosting *dynamic* pages. when I got a unix.dog account, there was no templating software such as php, hugo, etc. due to this, I was [making most of my pages manually](https://git.unix.dog/yosh/website/src/commit/62a41d9c17460dd1f445063f4f9aec8200891c52) to start off, which I knew very well wasn't going to scale. php did actually get added to the pubnix at some point between then and my "first draft" for something like this, but I don't remember when. Either way, I *thought* there wasn't any templating software still. I had a few options options: create my own using resources already on the system, ask (citali)[https://unix.dog/~nullobsi] to add something new, or find/compile one

-I chose to make my own. I've already had a bunch of experience writing shell, so I wanted to take a stab at another "real" project in it
+I chose to make my own. I've already had a bunch of experience writing shell, so I wanted to take a stab at another "big" thing in it, much like stuff akin to [flacconv](https://git.unix.dog/yosh/flacconv)

-[the complete end result](https://git.unix.dog/yosh/website/src/commit/68a17eb2afce3eeeca336a7d30d80b75d586fd5c/build.sh) fucking sucked in retrospect. how it'd work was that you'd have to make a single line comment following the form of `<!--RES f=file;VAR=val;VAR2=val2 -->` etc. to bring in a template in the `BUILD_RESOURCES` folder, replacing any strings matching `var-VAR` with `val`, `var-VAR2` with `val2`, etc.
+[the first draft](https://git.unix.dog/yosh/website/src/commit/68a17eb2afce3eeeca336a7d30d80b75d586fd5c/build.sh) fucking *sucked* in retrospect. how it'd work was that you'd have to make a single line comment following the form of `<!--RES f=file;VAR=val;VAR2=val2 -->` etc. to bring in a template in the `BUILD_RESOURCES` folder, replacing any strings matching `var-VAR` with `val`, `var-VAR2` with `val2`, etc.

-certainly, it was slow as shit as well. I can't remember the exact specifics, but it'd take almost 2 seconds to go through my (very small, mind you) site on my hardware, let alone how long it took on unix.dog's hardware (I want to say around 5-6 seconds?). all those grep calls weren't really helping. clearly, I needed something better
+certainly, it was slow as shit as well. I can't remember the exact specifics, but it'd take almost 2 seconds to go through my (very small at the time, mind you) site on my hardware, let alone how long it took on unix.dog's hardware (I want to say around 5-6 seconds?). all those grep calls weren't really helping. clearly, I needed something better

 again, I still thought no templating software was available, so I set out to make a *different* way to update my site with a shell script

@ -28,24 +28,88 @@ from the get-go, I had the idea of making a php-like templating method but with

 of course, this was slow as shit, as well as unreliable. It'd take over a second to build the site on my machine--which keep in mind, *still wasn't using markdown for anything*. this [was optimized](https://git.unix.dog/yosh/website/src/commit/ab5c46d14f0a46dd39be0b4432a1896d826bffa4/build.sh) to just about a second by doing a single run of `grep` for each file, storing the counted lines, then iterating over that. even still, the unreliability persisted, because in the event that the shell block outputted `<!--sh <stuff> -->`, the awk would fuck up and flag it. of course, any reasonable person *wouldn't* do this, but reasonable people *would* put `-->` for arrows in their webpage, and managing that proved to be a much harder task than one would think in awk. another refactor had to be done

-[by this point](https://git.unix.dog/yosh/website/src/commit/b20f843c47a7adb4f5cdd4e31ad7b498b227ee24/build.sh), I dropped the idea of using normal html comments and went full-on using `<?shp <shell> ?>` syntax (well, `<?sh` here, but that'll change in the next commit). this didn't exactly simplify parsing relatively--you'll see why later--but it *did* play in to the idea of making this more "php-like", which I find funny. instead of using the super slow method of `grep` + `awk` for every single file, I made a `while read -r` loop for the processing function, which fed into a `case` glob parser for each line, setting a flag depending on if the line contained `<?shp` or not. if the flag was null, then echo the line. if it was just set, echo everything before the `<?shp` and set it. if it is active (including just set), `eval` the line. if the line contains `?>`, then print everything after it and unset the shell flag. this significantly sped up the process, given how it wasn't iterating over the entire file multiple times--only once now
+[by this point](https://git.unix.dog/yosh/website/src/commit/9d7a7fd02cf4709c2f4ee456726b9948f39737bb/build.sh), I dropped the idea of using normal html comments and went full-on using `<?shp <shell> ?>` syntax (well, `<?sh` here, but that'll change in the next commit). this would actually have made the current implementation handle `-->` properly and generally be nicer to parse, but by now, instead of using the super slow method of `grep` + `awk` for every single file, I made a `while read -r` loop for the processing function, which fed into a `case` glob parser for each line, setting a flag depending on if the line contained `<?shp` or not. if the flag was null, then echo the line. if it was just set, echo everything before the `<?shp` and set it. if it is active (including just set), `eval` the line. if the line contains `?>`, then print everything after it and unset the shell flag. this significantly sped up the process, given how it wasn't iterating over the entire file multiple times--only once now--as well as not rely on external calls for each file, which incudes quite a bit of overhead!

-by this point, it was looking pretty good! my site was building much faster now--about 0.3 seconds on my machine. for this being in shell, that's pretty good! this wasn't without its own set of issues, though. for one, calling `eval` separately for each line was both slow and worked weird. if I needed multiline commands, such as a here-document or simply wrapping a long command across multiple lines, it'd break. the major issue though is that I didn't have a proper way of "including" other files. how I worked around this was by [directly calling other shell scripts](https://git.unix.dog/yosh/website/src/commit/9d7a7fd02cf4709c2f4ee456726b9948f39737bb/src/index.html), where the global working directory would always be the `include` directory. this was a bit clunky, and something that nagged at me. sure, it *worked*, but it wasn't really "including", and I was moreso just wrangling shell scripts together without a proper "include" function. one last time, I went back to the drawing board
+by this point, it was looking pretty good! my site was building much faster now--about 0.3 seconds on my machine. for this being in shell, that's pretty good! this wasn't without its own set of issues, though. for one, calling `eval` separately for each line was both slow and worked weird. if I needed multiline commands, such as a here-document or simply wrapping a long command across multiple lines, it'd break. the major issue though is that I didn't have a proper way of "including" other files. how I worked around this was by [directly calling other shell scripts](https://git.unix.dog/yosh/website/src/commit/9d7a7fd02cf4709c2f4ee456726b9948f39737bb/src/index.html), where the global working directory would always be the `include` directory. this was a bit clunky, and something that nagged at me. sure, it *worked*, but it wasn't really "including", and I was moreso just wrangling shell scripts together without a proper "include" function. as such, I went back to the drawing board

 the major thing I noticed with attempting to make a proper `include` function was shell variables. if you didn't know, variables in shell are *always* global unless set in a subshell/subprocess. this is actually why I made the variables that are used for the build script all caps and prefixed with two underscores--something to clearly indicate that these variables are only meant to be used by the build script, and you probably shouldn't be using variables that follow the same convention. anyway, I realized that making an include function that worked like php, where variables persist, would mean processing that file in the same shell environment. however, doing so would override crucial variables such as `__LINE` and `__SH_FLAG`, causing who knows what to break.

-realizing this made me come to the most cursed part of [my almost final script](https://git.unix.dog/yosh/website/src/commit/358459a67872a46e76569b4564faccce24623591/build.sh). I wrapped the *entire* file processing part in an `eval`, single-quoted as to not double-evaluate any parts. whenever it came to the important variables like `__LINE`, `__SH_FLAG`, and `__LINE_EXEC` (a variable I made to work around the "`eval` every line" issue), I would go out of the single quotes briefly to add a number to the end, defined by the variable `__SHLVL`. this variable is incremented by 1 each time `__process_shp` is run, then decremented when the function ends. this causes includes to use a different variable for their own contexts without influencing the parent's special variables, while keeping the benefit of making included variables work like php.
+realizing this made me come to the most cursed part of [the next iteration of the script](https://git.unix.dog/yosh/website/src/commit/358459a67872a46e76569b4564faccce24623591/build.sh). I wrapped the *entire* file processing part in an `eval`, single-quoted as to not double-evaluate any parts. whenever it came to the important variables like `__LINE`, `__SH_FLAG`, and `__LINE_EXEC` (a variable I made to work around the "`eval` every line" issue), I would go out of the single quotes briefly to add a number to the end, defined by the variable `__SHLVL`. this variable is incremented by 1 each time `__process_shp` is run, then decremented when the function ends. this causes includes to use a different variable for their own contexts without influencing the parent's special variables, while keeping the benefit of making included variables work like php.

 by now, it was basically perfect. implementing markdown support was very simple--passing the markdown file through `lowdown` and sending that to the stdin of a special script in the include directory, `_markdown-template.sh`. as I was writing this, I changed this to be a proper `shp` file by making `__process_shp()` take an argument rather than consuming stdin for itself. personally, I never needed like, a bunch of html outside of `include` for markdown--just using shell's `echo` with commands was enough for me--but oh well! at least it's consistent now, and it doesn't incur a big speed detriment

-## the result
-as it stands, I feel as if I am completely done with this. the "library" gives you only a few basic functions--`include`, `set_md_metadata`, and `escape`--and the rest is up for you to figure out. this site builds in ~0.4 seconds on my machine as it stands--about a second on unix.dog. while slower than one of the previous iterations, it makes up for that with like 10x more functionality. it scales slightly worse than linearly, which is to be expected, but build times shouldn't get too bad. oh well!
+## an update from the future
+this post has become a bit of a "living post", so to say. I thought I was finished with the script by now--by all means I went 4 ish months without touching it to a high degree, but recently I did a big refactor *again*, and now I want to document it. for reference, [this was the state of the script right before the next big refactor](https://git.unix.dog/yosh/website/src/commit/acb0cf43b54ba1d8a9b92fb141aeb2b4eb129256/build.sh). there's a little bit of change between it and the state when I wrote this blog post, but I'll only focus on one of them in particular. this part's about the nitty gritty

-the only caveat that I have seen that remains in the final script is the fact that you can't start a `shp` block on the same line as one ends. for the sake of keeping the parsing code simple, I am not changing this. any reasonable person wouldn't do this.
+to start, I realized *very quickly* that I wanted an RSS feed for my blog. additionally, there was a lot of duplicated html between stuff like [credits](/~yosh/credits.md) and blog posts. as such, I needed metadata for each markdown file on my site. my initial idea for this was leaving the markdown files as-is and creating a new file, `mdmeta`, inside the same directory. this actually worked quite well, and I was pleasantly surprised with the result... but it was *very* fickly. filenames had to match and the friction between "writing" and "modifying metadata" was way too high. I got tired after a few months, and set out for something better

-## the updated result
-this part is being written from the future (2024-03-16). I went through a few iterations of the script for a bit, having an `mdmeta` at first from the end of my initial result, but now metadata is within each markdown file. the site, as it stands, builds in ~0.4 seconds on my machine. takes about 0.8 seconds on unix.dog. no external program calls are inherently part of processing any given file, except for `lowdown` for markdown processing. it's all pure shell. I consider that a win for a shell script of this caliber
+initially, I wanted to keep markdown files pure and unencumbered, left for the text alone with metadata delegated elsewhere. however, after realizing that `mdmeta` wasn't doing the trick, I realized that I basically *had* to put it in the markdown files themselves. as such, I abused the idea of stdin to require only one pass through each file, changing `set_md_metadata` to work based on a header like so:

-this part is being written from a slightly further future (2024-04-01). I remembered that you can parallelize shell scripts, so I added that as well. it now builds in 0.65-0.7 seconds on my machine, and goes down to 0.55-0.6 when not counting the overhead for moving files :). on unix.dog, it builds in ~0.3 seconds (a bunch of stuff was running too, like ipfs migration.... it's very jumpy! that's also with nproc * 2 threads. more testing needed... but not really warranted right now. it's fast enough :D)
+```
+---   # any amount of - in a row
+variable = value
+var = foo
+---   # any amount of - in a row
+```

-the script is [bundled with my website repo](https://git.unix.dog/yosh/website/src/branch/master/build.sh). it's a shell script. I don't care enough to give it the whole 9 yards with a separate repo and license and etc. I don't think anyone other than me is even crazy enough to use it. be reasonable
+then sending the rest of the stdin to [the markdown template](https://git.unix.dog/yosh/website/src/branch/master/include/_markdown-template.shp). works surprisingly well for what is is
+
+it was at this point that I realized the *sheer magnitude* of the overhead for *simply calling external programs in some situations*. when I was looking over the script some, the way I always had `__process_shp` nagged me. it's such an eyesore. surely there has to be a better way? I revisited my good friend awk and whipped up this script:
+
+```
+eval "$(awk -v RS='<\\?shp|\\?>' \
+	'{n=!n; gsub(/'\''/,"'\''\\'\'''\''")}
+	!n{print "eval '\''" $0 "'\''"}
+	n{print "printf %s '\''" $0 "'\''"}' "${1:--}"
+	)"
+```
+
+please mind the massive amount of escaping quotes. essentially, what the awk does is have two states, represented by `n`. for each record--where the record separator is `<?shp` *or* `?>`--it first replaces all single quotes `'` with `'\''`, then either prints `printf %s '<text>'` or `eval '<text>'`. the quote replacement was to make sure all quotes are escaped correctly. then, since that's wrapped in one big `eval`, it either prints data or executes shell. this is quite clever! I felt good about making this! but it ran slower. about 3x slower, in fact. ugh. I even tried experimenting with using only shell but having it work with the entire block of data rather than a line of a time, but this went into issues with making sure stuff was truncated right, which even still, made it slower than the while read loop. **what I have in the repo right now is the fastest method I have found**, which is crazy for me to think about
+
+anyway, after realizing that simply calling these external tools incurs overhead, I turned my head towards another bottleneck--`escape`. it was just a simple 5-statement sed call for escaping html, but because it was an external program, calling it incurred a surprising amount of overhead! I refactored it specifically for the purposes of the build script... and it was *abysmally* slow at building my RSS feed content field. I realized then that the pure-shell solution is *faster* for *small* input, but *slower* for large input. as such, I split the functions! `escape` and `escapepipe`. the former only accepts input via its argument, which is suitable for small inputs, while the latter only accepts input through, well, a pipe, which is usually the case for larger inputs (notably `content` for rss feeds)
+
+the site was building quickly. 0.35 seconds on my machine! quite impressive for what it's worth. but of course, there's almost always possibility for improvement. in all my optimizations for single-threaded execution, I sorely forgot the most powerful optimization of them all: multithreading! well, less of an optimization and more of "squeezing as much as possible out of the shell". wanna know how to multithread natively in shell--no special `parallel` program or anything? here you go:
+
+```
+# make a parallel fifo/file descriptor we need
+__fifo_para="$(mktemp -u -t "shpbuild.XXXXXX")"
+mkfifo "$__fifo_para"
+exec 9<>"$__fifo_para"
+rm -f "$__fifo_para"
+while [ "$__THREADS" -gt 0 ]; do
+	printf "\n" >&9 # start with THREADS amount of lines in fd 9 for later
+	__THREADS="$((__THREADS - 1))"
+done
+
+# read each line from fd 9, launch new program for each line
+# print a line after program finished such that another one can take its place
+__run_in_parallel() {
+	read -r __ <&9
+	{
+		"$@"
+		printf '\n' >&9
+	} &
+}
+```
+
+a clever hack for sure, and one that came in quite handy. I set `__THREADS` to `nproc * 2`, which seems to be a good balance for speed both on my machine and unix.dog
+
+## conclusion
+with that, we have hit the end of the road. I don't know of any more optimizations to be had here. I've pulled out all the tricks I can--little external calls, avoiding big data wrangling with shell, *multithreading* for crying out loud, and now I believe I have spent it all. the results have proved fruitful:
+
+```
+# on my machine
+real    0m0.065s
+user    0m0.026s
+sys     0m0.029s
+
+# on unix.dog:
+real    0m0.224s
+user    0m0.053s
+sys     0m0.039s
+```
+
+pretty damn impressive for shell. the bottlenecks now lie mostly on the files rather than the shell, which means I can finally rest easy. I'm not touching this anymore--it's perfect. an absolutely insane journey through squeezing as much as I can out of the shell for the sake of this site. at least I can move on to writing now
+
+## where to find it
+the script is [bundled with my website repo](https://git.unix.dog/yosh/website/src/branch/master/build.sh). if you want a good idea of using it effectively, poke around my site some--and take a look at those feeds especially. in any case, it's a shell script. I don't care enough to give it the whole 9 yards with a separate repo and license and etc. I don't think anyone other than me is even crazy enough to use it. be reasonable