0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

【perl】MiSeqの出力ファイル名からサンプル名を抽出するワンライナー

Last updated at Posted at 2021-10-09

実行環境

Microsoft Windows [Version 10.0.19042.1237]
Windows Subsystem for Linux 2
Ubuntu 20.04.3 LTS
perl5 (revision 5 version 26 subversion 2)

処理
ls -1v | perl -lne 'if(/_R1/){print "$`"}'
対象データ
(base) usr@LIN:~/test$ ls -1v 
SampleName1_S1_L001_R1_001.fastq.gz
SampleName1_S1_L001_R2_001.fastq.gz
SampleName2_S2_L001_R1_001.fastq.gz
SampleName2_S2_L001_R2_001.fastq.gz
実行結果
(base) usr@LIN:~/test$ ls -1v | perl -lne 'if(/_R1/){print "$`"}'
SampleName1_S1_L001
SampleName2_S2_L001

解説

perl の特殊変数1

  1. マッチした行全体:$_(省略可)
  2. マッチした文字列:$&
  3. マッチした部分より前の文字列:$`
  4. マッチした部分より後の文字列:$'

grepだと行単位で前後の出力はできるが、行内だと難しい。
今回は 3 を使用。

perl の条件分岐とパターンマッチ

'if(/_R1/){print "$`"}'

「入力行にもし_R1が含まれていれば、_R1より前の文字列を出力する」という意味。

その他

  • bash
    ls -1:ディレクトリ一覧を縦に並べる
    ls -v:ディレクトリ一覧を数字の昇順で並べる(100が2より後にくる)

  • perl
    -n:一行ずつ読み込んで最終行まで処理する
    -l:入力行から行末の改行を除去して、出力時に改行を付ける
    -e:ワンライナーであることを宣言(オプションの中で最後に付ける)

正規表現を使った実用例

処理
ls -1v | perl -lne 'if(/_.*_.*_R1/){print "$`"}'
実行結果
(base) usr@LIN:~/test$ ls -1v | perl -lne 'if(/_.*_.*_R1/){print "$`"}'
SampleName1
SampleName2

perlの正規表現2

.:改行を除く任意の一文字
*:直前の文字が0個以上

備考

MiSeqの出力ファイル名(フォーマット)(*イルミナHPから引用, 2021/10/09)

Naming Convention

FASTQ files are named with the sample name and the sample number, which is a numeric assignment based on the order that the sample is listed in the sample sheet. For example: Data\Intensities\BaseCalls\SampleName_S1_L001_R1_001.fastq.gz

SampleName —The sample name provided in the sample sheet. If a sample name is not provided, the file name includes the sample ID, which is a required field in the sample sheet and must be unique.
S1 —The sample number based on the order that samples are listed in the sample sheet starting with 1. In this example, S1 indicates that this sample is the first sample listed in the sample sheet.
  NOTE
  >Reads that cannot be assigned to any sample are written to a FASTQ file for sample number 0, and excluded from downstream analysis.
L001 —The lane number.
R1 —The read. In this example, R1 means Read 1. For a paired-end run, there is at least one file with R2 in the file name for Read 2. When generated, index reads are I1 or I2.
001 —The last segment is always 001.
FASTQ files that do not follow this naming convention cannot be imported into BaseSpace.

参考サイト

  1. Perlの特殊変数

  2. Perlの正規表現をマスターしよう

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?