私はオーストリアの田舎からです。あそこにはクリスマスの周りに Advent Calendar が子供の時から大事な物です。息子は私と同じように楽しみの気持ちでドアを開けて甘いものを見つけてすごく懐かしい事です。ただ、私の時代は今みたいに上手いものだけ入っていたわけではありません。色々入っていた。手作りのカレンダーにナッツ、みかん、おもちゃ、色々入っていた。時々あんまり美味しくないものが入っていると別の日のチョコレートがさらに美味しくなりました。
なんで Node.js の Advent Calendar のにそれを説明しますか?今日はもしかしてちょっと苦いものがカレンダーに入っています。まずは英語で書いてあります。寝部族なので自分の日本語の自信は全くなくなりました。(ここに書いているもはだいたい理解できるとすごく嬉しいです。)そして今日のテーマは DAT についてです。DAT は http と比べて、同じローレベルのことなので、beta までは後 2年ぐらいかかるかもしれません。今日利用できる DAT は一応うごている。一応何かの為に使えるですが、未来の技術です。今朝 8時まで https://gitter.im/datproject/discussions を使ってに皆と英語で相談しながら [DAT PUSH 問題について] 調べました。調べる間はいくつかの小さな問題気づきました… その問題は今日の記事に関係あるから、記事はちゃんと進めませんでした。大変申し訳ないです。
短くても英語でも今日のカレンダーのエントリーを読んでくれてありがとうございました。
Have a merry Christmas
and a Happy New Year!
Distributed Remote Processing
This article is a follow up on two presentations I held around NodeFest 2018. I recommend to first look at the slides to Intro to DAT and DAT Workshop.
In this article you can learn how to start a process that reads a DAT, processes its content and allow someone to download the processed content.
Motivation
(Why would we want to do this?)
There is a fundamental problem when you want to process any data set. Should I move the data to the processor or should I move the processor to the data?
With DAT it is possible to prepare a data set on one computer and let another
computer access parts of it. Safely and quickly.
Introduction to darp²
darp² is a tool I wrote that simplifies processing of any given DAT content based on its link. The straight-forward way to use it is to install it as command-line-tool and set a bash script as command to execute.
You can install it globally like this:
$ npm install --global darp2
Example:
$ npm install --global img-resize-cli
$ darp2 \
--input dc193d0c4b16d94a544955e462ae64ec955858ca2fd2a2a2602ff2fbe9158278 \
--cmd "
cp -r \$DARP2_TEMP_FOLDER/* \$DARP2_WORK_FOLDER;\
cd \$DARP2_WORK_FOLDER;\
rsize \"images/*.JPG\" \".\" --porcent 20"
Lets look at what is happening here:
- We install img-resize-cli that allows resizing of images.
- We pass
--input ...
a DAT link to todarp2
- We pass
--cmd
todarp2
that is to be executed when the data is downloaded. -
darp2
passes two environment variables to the command:-
DARP2_TEMP_FOLDER
: Folder where all content of the DAT is downloaded to. -
DARP2_WORK_FOLDER
: Folder where the output should write all data to.
-
-
cp -r $DARP2_TEMP_FOLDER/* $DARP2_WORK_FOLDER
... copies everything from the download folder to the work folder. - By default
darp2
will execute the command in the download folder, withcd \$DARP2_WORK_FOLDER
we move the command to the work folder. -
img-resize-cli
added thersize
command which we use to resize all images to 20% of the size.
This example should decently show how smooth it could be to process data remotely. What is still a problem though is that we need to download all of the data to the processor before we can process it. In this example case that is "just" 62MB but suppose that grow to a gigabyte or a terabyte? Probably the data will not be as quickly next to the processor.
When you call this you should notice that you can get the link to the output DAT before the processing or download has started! This is because a DAT has the ability to add data after a link is created!
Processing with Node.js
The reason why it has to download all of the data is because darp2
doesn't know which files are needed by the --cmd
that is passed to it. If it knew that, we could only download the necessary parts!
Keen eyes may notice that there is the darp2 --glob
option to select which files to download. This may help in some cases but we want to go about it in a more fine-grained fashion.
How the darp2 --cmd
works
When you pass --cmd
to darp2
it creates a lightweight wrapper around that command and passes the result to the --run-script
option!
The run-script
expects an object with three methods:
interface Entry {
location: string
stats: Stat
}
module.exports = {
processFolder?: (tempFolder: string, workFolder: string, entry: Entry, inArchive: Hyperdrive) => Promise<any> | any,
processFile?: (tempFolder: string, workFolder: string, entry: Entry, inArchive: Hyperdrive) => PromiseLike<any> | any,
finish?: (inArchive: Hyperdrive, outArchive: Hyperdrive) => Promise<any> | any
}
-
processFolder
is called for every folder in the DAT. -
processFile
is called for every file in the DAT. -
finish
is called after every file & folder is finished.
You might notice that archives are not DAT
instances but Hyperdrive
instances.
hyperdrive is a part-system of DAT that abstracts file-system accesses without exposing networking or other parts:
You can access a hyperdrive almost like fs, but when you request data it will not load it from your harddrive but from connected peers!
Now, when a --cmd
command is called, it will simply download every file on processFile
into the tempFolder.
Once all files & folders are finished, finish
will add the files of the work folder to the outArchive
.
Writing our own script
const { minify } = require('html-minifier')
const { copyToFs, readFile, writeFileToFs } = require('darp2')
module.exports = {
async processFile (tempFolder, workFolder, entry, inArchive, outArchive, log) {
if (/\.html$/i.test(entry.location)) {
const html = await readFile(inArchive, entry.location, inArchive)
const opts = {
collapseWhitespace: true,
removeAttributeQuotes: true,
removeOptionalTags: true
}
log({ minHtml: entry.location, opts })
const minHtml = minify(html.toString(), opts)
// Lets store the html files at the same location they have been.
return writeFileToFs(workFolder, entry.location, minHtml)
}
if (/\.(css|js|woff|otf|jpe?g|gif|png)$/i.test(entry.location)) {
log({ copy: entry.location })
return copyToFs(inArchive, workFolder, entry.location) // Just copy for now
}
log({ skip: entry.location })
// Skip other files
}
}
Lets go quickly through the script:
- it only has a
processFile
operation. (processFolder
andfinish
) are not necessary for our work - It checks the entry location if we have a html file and compresses it using html-minifier
- It logs the options, this way we can see in the outputs
.dat-through.log
what options we used at that time. If we wanted to be thorough we could also include the package and node.js version - It checks non-html files if they are css, js,... files and copies those to the target archive.
- It skips all other files. Those will not be downloaded!
Executing the script
This is a simple example of a own darp²
script could look like. You can pass this to the --run-script
option and it will be executed.
$ darp2 \
--run-script index.js \
--input dc193d0c4b16d94a544955e462ae64ec955858ca2fd2a2a2602ff2fbe9158278
This will run the script against every file in the input DAT. The output link can be used as soon as it appears in the command line.
The .dat-through.json
file contains a comprehensive log of what happend (including the log statements in the script).
You can immediately watch the result with Beaker Browser. Try to have a look a the "View Source"
Final thoughts
darp²
is opening the door to computing any data remotely and immediately working with the result.
While the implementation of darp²
is still crude, it has a lot of potential. For example: it could come with a http server that allows to trigger the remote processing in a docker container, then you could start remote processors in any cloud hoster in 30 seconds.
The input DAT link and the output DAT link are equally usable. It is possible to pipe any number of operations in this remote system. With a bit tuning of the command-line logic, the piping could actually be a lot quicker as well.
I hope this brought you a bit closer to darp²
and the power of distributed remote processing.