Windows
AWS
S3
PowerShell

The story when I verified whether the data copied to S3 is really the same as the copy source

This article is an automatic translation of the article[0710a3de2628ef730500] below.

https://qiita.com/speaktech/items/0710a3de2628ef730500


1.First of all

Last year at the end of the year, old personal data came up at my home PC, so I wrote a script ** PowerShell ** to archive the data whose last update date and time has passed over to AWS's cloud storage service S3.

If you just copy it, you can easily do it using the ** Write-S3Object ** command of"AWS Tools for Windows PowerShell"provided with AWS.

However, if you copy it to S3, you will have the courage to erase the local data. "Is it really the same in the source and destination?" I was slightly worried.

So ** we just added a process to verify that the copy source and destination are the same when copying, instead of merely copying to S3 and deleting local data **.

In verifying, I noticed various things about the specification of S3, so I will share the contents.


2. Required preparation


2.1. Installing AWS Tools for Windows PowerShell

Download AWS Tools from the following page and install it.

Https://aws.amazon.com/jp/powershell/

Detailed setup procedures and prerequisites are listed on the following page.

Https://docs.aws.amazon.com/en-US/powershell/latest/userguide/pstools-getting-set-up-windows.html#prerequisites


2.2. Creating profiles

After installing the above AWS Tools, register the access key and secret key of the AWS account with the ** Set - AWSCredential ** command and create a profile.

Https://docs.aws.amazon.com/en-US/powershell/latest/userguide/specifying-your-aws-credentials.html

# 例1)プロファイルの登録

PS C:\> Set-AWSCredential -AccessKey アクセスキー -SecretKey シークレットキー -StoreAs プロファイル名

This completes the necessary preparations.


3. Specification of S3


3.1. About the multipart upload function

When uploading to S3 with PowerShell, use the ** Write - S3Object ** command.

** Uploading by Write - S3 Object ** will forcibly ** multipart upload ** will be executed if data of 16 MB or more is targeted.

Multipart upload refers to the process of dividing local data into 5 MB block size and uploading to S3 in parallel.

By default, uploading is performed in 10 parts parallel. The number of simultaneous uploads in parallel can also be controlled by the ** - ConcurrentServiceRequest ** subcommand, so simple bandwidth control is possible.

(When confirmed with the netstat command, 10 sessions were surely established for S3.)


3.2. About the eTag property of the S3 object

In order to verify the integrity of the data uploaded to S3 (whether it is the same at the copy source and before), when I was investigating, I noticed that the eTag property of ** S3 matches the MD5 digest of local data ** It was.

(There was a description that officially also uses the MD5 digest.)

If I compare this, I thought that it could be confirmed that the local data was completely copied on S3, but, ** It was not successfully applied when multipart uploading was done **.

That's because eTag in S3 is calculating eTag by choosing the following logic depending on the presence/absence of multipart upload.


i. eTag with no multipart upload (data smaller than 16 MB)

Match MD5 digest of local data

Example) 10 MB of data

ETag property of S3: cd 573 cfaace 07 e 7949 bc 0 c 4602 890 4 ff

Local data MD5: cd 573 cfaace 07 e 7949 bc 0 c 4602 890 4 ff


ii. eTag with multipart uploading (data of 16 MB or more)

① Read data in binary format

② Separate each part and acquire each MD5 digest

※ Part size of S3 is 5MB

③ Combine MD5 digests for each part

④ Obtain the MD5 digest of the combined MD5 digest

⑤ Add the total part number of the multipart with the hyphen at the end of ④

Example) 1 GB of data

ETag property of S3: cb45770d6cf51effdfb2ea35322459c3-205


4. Integrity verification


4.1. Validation policy

** Depending on the file size (16 MB is the threshold value), we decided to calculate the eTag value of S3 from the local data and compare it with the actual eTag property **.

For eTag calculation of multi part upload available (data of 16 MB or more), refer to the following code. (We have made some corrections to the calculation logic.)


4.2. Result of verification

I tried to verify with 15 MB, 16 MB, 17 MB dummy file and large capacity file created by fsutil command, but in both cases the S3 eTag value calculated from local data * matches the actual eTag property I was able to confirm **.

With the above, we have established the procedure to confirm"It is the same at the copy source and the destination", so if you can confirm the completeness with this procedure, you can safely delete the local data.


5. postcard

The Powershell command group used for verification is called"Function to calculate eTag value of S3 from local data ** Get-S3 ETagHash *"and"Function to perform a series of processes of confirming integrity and deleting local data * Archive-FileToS3 **"in a script.

(The former is based on https://gist.github.com/seanbamforth/9388507)

By importing this script ** Archive-FileToS3.ps1 **, you can use the function as follows. I will post it below for reference.

# 例2)スクリプトのインポートと関数の利用方法

PS D:\Tmp> Import-Module .\Archive-FileToS3.ps1

PS D:\Tmp> Get-S3ETagHash -Path D:\Tmp\16MB.txt

Algorithm Hash Path
--------- ---- ----
S3ETag eafa449afe224ad0b7f8f5bab4145d13-4 D:\Tmp\16MB.txt

PS D:\Tmp> Archive-FileToS3 -Path アーカイブ対象フォルダ -Bucketname バケット名 -Days 期間


[Reference] Powershell script (Archive-FileToS3.ps1)


Archive-FileToS3.ps1

# Archive-FileToS3 archives local files which is older than specified days in a path to S3bucket with verifying integrity of uploaded S3objects.

# The verification method is to check if an eTag of S3object matches an eTag value calculated from a local file.

# Get-S3ETagHash calculates an eTag for a local file that should match the S3 eTag of the uploaded file.
# Credit goes to Sean Bamforth (https://gist.github.com/seanbamforth/9388507) and chrisdarth(https://gist.github.com/chrisdarth/02d030b31727d70d2c63)

function Archive-FileToS3 {
[cmdletbinding()]
Param (
[Parameter(Mandatory=$true)]
[ValidateScript({ Test-Path $_ -PathType Container })]
[string]$Path,
[Parameter(Mandatory=$true)]
[ValidateScript({ $( Get-S3Bucket -BucketName $_ ) })]
[string]$Bucketname,
[Parameter(Mandatory=$true)]
[Int32]$Days,
[Int32]$ConcurrentServiceRequest = 10
)
if ($Path[$Path.Length-1] -eq "\" ){ $Path = $Path.Substring(0,$Path.Length-1) }
foreach ($file in $(Get-ChildItem $Path -Recurse | Where-Object{$_.Attributes -ne "directory"})){
if ($file.LastWriteTime -lt ((Get-Date).AddDays(-$Days))){
Write-S3Object -BucketName $Bucketname -Key $file.FullName.Substring($Path.Length + 1) -File $file.FullName -ConcurrentServiceRequest $ConcurrentServiceRequest
$s3object = Get-S3Object -BucketName $Bucketname -Key $file.FullName.Substring($Path.Length + 1)
$etag = $s3object.etag.Replace("`"","")
$hash = $(Get-S3ETagHash($file.FullName)).Hash
if ($etag -eq $hash){
Remove-Item $file.FullName -Force
}else{
Remove-S3Object -BucketName $Bucketname -Key $file.FullName.Substring($Path.Length + 1) -Force
}
}
}
}

function Get-S3ETagHash {
[cmdletbinding()]
Param (
[Parameter(Mandatory=$true)]
[ValidateScript({ Test-Path $_ -PathType Leaf })]
[string]$Path,
[Int32]$ChunkSize = 5
)
$filename = Get-Item $Path
$md5 = new-object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider

$blocksize = (1024*1024*$ChunkSize)
$startblocks = (1024*1024*16)

$lines = 0
[byte[]] $binHash = @()

$reader = [System.IO.File]::Open($filename,"OPEN","READ")

if ($filename.length -ge $startblocks) {
$buf = new-object byte[] $blocksize
while (($read_len = $reader.Read($buf,0,$buf.length)) -ne 0){
$lines += 1
$binHash += $md5.ComputeHash($buf,0,$read_len)
}
$binHash=$md5.ComputeHash( $binHash )
}
else {
$lines = 1
$binHash += $md5.ComputeHash($reader)
}

$reader.Close()

$hash = [System.BitConverter]::ToString( $binHash )
$hash = $hash.Replace("-","").ToLower()

if ($lines -gt 1) {
$hash = $hash + "-$lines"
}

# Output pscustomobject, equal to Get-FileHash
[pscustomobject]@{
Algorithm = "S3ETag"
Hash =$hash
Path = $filename.FullName
}
return
}