nodejs
distributed
processor
Node.jsDay 19

Distributed Remote Processing

私はオーストリアの田舎からです。あそこにはクリスマスの周りに Advent Calendar が子供の時から大事な物です。息子は私と同じように楽しみの気持ちでドアを開けて甘いものを見つけてすごく懐かしい事です。ただ、私の時代は今みたいに上手いものだけ入っていたわけではありません。色々入っていた。手作りのカレンダーにナッツ、みかん、おもちゃ、色々入っていた。時々あんまり美味しくないものが入っていると別の日のチョコレートがさらに美味しくなりました。

なんで Node.js の Advent Calendar のにそれを説明しますか?今日はもしかしてちょっと苦いものがカレンダーに入っています。まずは英語で書いてあります。寝部族なので自分の日本語の自信は全くなくなりました。(ここに書いているもはだいたい理解できるとすごく嬉しいです。)そして今日のテーマは DAT についてです。DAT は http と比べて、同じローレベルのことなので、beta までは後 2年ぐらいかかるかもしれません。今日利用できる DAT は一応うごている。一応何かの為に使えるですが、未来の技術です。今朝 8時まで https://gitter.im/datproject/discussions を使ってに皆と英語で相談しながら [DAT PUSH 問題について] 調べました。調べる間はいくつかの小さな問題気づきました… その問題は今日の記事に関係あるから、記事はちゃんと進めませんでした。大変申し訳ないです。

短くても英語でも今日のカレンダーのエントリーを読んでくれてありがとうございました。


Have a merry Christmas
and a Happy New Year!


Distributed Remote Processing

This article is a follow up on two presentations I held around NodeFest 2018. I recommend to first look at the slides to Intro to DAT and DAT Workshop.

In this article you can learn how to start a process that reads a DAT, processes its content and allow someone to download the processed content.


Motivation

(Why would we want to do this?)

There is a fundamental problem when you want to process any data set. Should I move the data to the processor or should I move the processor to the data?

With DAT it is possible to prepare a data set on one computer and let another

computer access parts of it. Safely and quickly.


Introduction to darp²

darp² is a tool I wrote that simplifies processing of any given DAT content based on its link. The straight-forward way to use it is to install it as command-line-tool and set a bash script as command to execute.

You can install it globally like this:

$ npm install --global darp2

Example:

$ npm install --global img-resize-cli

$ darp2 \
--input dc193d0c4b16d94a544955e462ae64ec955858ca2fd2a2a2602ff2fbe9158278 \
--cmd "
cp -r
\$DARP2_TEMP_FOLDER/* \$DARP2_WORK_FOLDER;\
cd
\$DARP2_WORK_FOLDER;\
rsize
\"images/*.JPG\" \".\" --porcent 20"

Lets look at what is happening here:


  1. We install img-resize-cli that allows resizing of images.

  2. We pass --input ... a DAT link to to darp2

  3. We pass --cmd to darp2 that is to be executed when the data is downloaded.


  4. darp2 passes two environment variables to the command:



    • DARP2_TEMP_FOLDER: Folder where all content of the DAT is downloaded to.


    • DARP2_WORK_FOLDER: Folder where the output should write all data to.




  5. cp -r $DARP2_TEMP_FOLDER/* $DARP2_WORK_FOLDER ... copies everything from the download folder to the work folder.

  6. By default darp2 will execute the command in the download folder, with cd \$DARP2_WORK_FOLDER we move the command to the work folder.


  7. img-resize-cli added the rsize command which we use to resize all images to 20% of the size.

This example should decently show how smooth it could be to process data remotely. What is still a problem though is that we need to download all of the data to the processor before we can process it. In this example case that is "just" 62MB but suppose that grow to a gigabyte or a terabyte? Probably the data will not be as quickly next to the processor.

When you call this you should notice that you can get the link to the output DAT before the processing or download has started! This is because a DAT has the ability to add data after a link is created!


Processing with Node.js

The reason why it has to download all of the data is because darp2 doesn't know which files are needed by the --cmd that is passed to it. If it knew that, we could only download the necessary parts!

Keen eyes may notice that there is the darp2 --glob option to select which files to download. This may help in some cases but we want to go about it in a more fine-grained fashion.


How the darp2 --cmd works

When you pass --cmd to darp2 it creates a lightweight wrapper around that command and passes the result to the --run-script option!

The run-script expects an object with three methods:

interface Entry {

location: string
stats: Stat
}

module.exports = {
processFolder?: (tempFolder: string, workFolder: string, entry: Entry, inArchive: Hyperdrive) => Promise<any> | any,
processFile?: (tempFolder: string, workFolder: string, entry: Entry, inArchive: Hyperdrive) => PromiseLike<any> | any,
finish?: (inArchive: Hyperdrive, outArchive: Hyperdrive) => Promise<any> | any
}



  • processFolder is called for every folder in the DAT.


  • processFile is called for every file in the DAT.


  • finish is called after every file & folder is finished.

You might notice that archives are not DAT instances but Hyperdrive instances.

hyperdrive is a part-system of DAT that abstracts file-system accesses without exposing networking or other parts:

You can access a hyperdrive almost like fs, but when you request data it will not load it from your harddrive but from connected peers!

Now, when a --cmd command is called, it will simply download every file on processFile into the tempFolder.

Once all files & folders are finished, finish will add the files of the work folder to the outArchive.


Writing our own script

[repo]

const { minify } = require('html-minifier')

const { copyToFs, readFile, writeFileToFs } = require('darp2')

module.exports = {
async processFile (tempFolder, workFolder, entry, inArchive, outArchive, log) {
if (/\.html$/i.test(entry.location)) {
const html = await readFile(inArchive, entry.location, inArchive)
const opts = {
collapseWhitespace: true,
removeAttributeQuotes: true,
removeOptionalTags: true
}
log({ minHtml: entry.location, opts })
const minHtml = minify(html.toString(), opts)

// Lets store the html files at the same location they have been.
return writeFileToFs(workFolder, entry.location, minHtml)
}
if (/\.(css|js|woff|otf|jpe?g|gif|png)$/i.test(entry.location)) {
log({ copy: entry.location })
return copyToFs(inArchive, workFolder, entry.location) // Just copy for now
}
log({ skip: entry.location })
// Skip other files
}
}

Lets go quickly through the script:


  1. it only has a processFile operation. (processFolder and finish) are not necessary for our work

  2. It checks the entry location if we have a html file and compresses it using html-minifier

  3. It logs the options, this way we can see in the outputs .dat-through.log what options we used at that time. If we wanted to be thorough we could also include the package and node.js version

  4. It checks non-html files if they are css, js,... files and copies those to the target archive.

  5. It skips all other files. Those will not be downloaded!


Executing the script

This is a simple example of a own darp² script could look like. You can pass this to the --run-script option and it will be executed.

[repo]

$ darp2 \

--run-script index.js \
--input dc193d0c4b16d94a544955e462ae64ec955858ca2fd2a2a2602ff2fbe9158278

This will run the script against every file in the input DAT. The output link can be used as soon as it appears in the command line.

The .dat-through.json file contains a comprehensive log of what happend (including the log statements in the script).

You can immediately watch the result with Beaker Browser. Try to have a look a the "View Source" :wink:


Final thoughts

darp² is opening the door to computing any data remotely and immediately working with the result.

While the implementation of darp² is still crude, it has a lot of potential. For example: it could come with a http server that allows to trigger the remote processing in a docker container, then you could start remote processors in any cloud hoster in 30 seconds.

The input DAT link and the output DAT link are equally usable. It is possible to pipe any number of operations in this remote system. With a bit tuning of the command-line logic, the piping could actually be a lot quicker as well.

I hope this brought you a bit closer to darp² and the power of distributed remote processing.