CSVパーサーのmax_quoted_size_limitについて
Embulk 0.5.0でCSVパーサーに導入されたmax_quoted_size_lmitdについて
このオプションはある行のカラム内で、クォート文字がありその中でデリミタ文字(,)等が出現した場合に
クォートの閉じ忘れかどうかを何バイトまで先読みしてチェックをするかを指定するパラメータです。
例
設定ファイル
max_quoted_size_limitを6に設定
in:
type: file
path_prefix: /path/to/test
parser:
charset: UTF-8
newline: CRLF
type: csv
delimiter: ','
quote: '"'
escape: ''
header_line: false
columns:
- {name: c0, type: string}
- {name: c1, type: string}
- {name: c2, type: string}
- {name: c3, type: string}
- {name: c4, type: string}
max_quoted_size_limit: 6
exec: {}
out: {type: stdout}
検証データ
1,b,c,d,e
2,",123456",b,c,error line
3,",12345",b,c,safe line
- 1行目のデータは正常なデータ(クォート無し)
- 2行目のデータは、クォートは,の後6バイトデータが続き7バイト目にとじクォート
- エラーになることを期待
- 検証しやすいのでこのようにした。
- 3行目は、6文字目に閉じクォート
実行
2行目でエラーが検出されています。
% embulk preview config.yml
2015-03-05 09:47:07.679 +0900: Embulk v0.5.0
2015-03-05 09:47:09.060 +0900 [INFO] (preview): Listing local files at directory '/path/to' filtering filename by prefix 'test'
2015-03-05 09:47:09.069 +0900 [INFO] (preview): Loading files [/path/to/test.csv]
2015-03-05 09:47:09.203 +0900 [WARN] (preview): Skipped (line 2): 2,",123456",b,c,error line
org.embulk.standards.CsvTokenizer$QuotedSizeLimitExceededException: The size of the quoted value exceeds the limit size (6)
at org.embulk.standards.CsvTokenizer.nextColumn(CsvTokenizer.java:278)
at org.embulk.standards.CsvParserPlugin.nextColumn(CsvParserPlugin.java:216)
at org.embulk.standards.CsvParserPlugin.access$000(CsvParserPlugin.java:30)
at org.embulk.standards.CsvParserPlugin$1.stringColumn(CsvParserPlugin.java:175)
at org.embulk.spi.Column.visit(Column.java:57)
at org.embulk.spi.Schema.visitColumns(Schema.java:48)
at org.embulk.standards.CsvParserPlugin.run(CsvParserPlugin.java:132)
at org.embulk.spi.FileInputRunner.run(FileInputRunner.java:145)
at org.embulk.exec.PreviewExecutor$2$1.run(PreviewExecutor.java:106)
at org.embulk.spi.util.Filters$RecursiveControl.transaction(Filters.java:83)
at org.embulk.spi.util.Filters.transaction(Filters.java:36)
at org.embulk.exec.PreviewExecutor$2.run(PreviewExecutor.java:96)
at org.embulk.spi.FileInputRunner$RunnerControl$1$1.run(FileInputRunner.java:117)
at org.embulk.standards.CsvParserPlugin.transaction(CsvParserPlugin.java:89)
at org.embulk.spi.FileInputRunner$RunnerControl$1.run(FileInputRunner.java:111)
at org.embulk.spi.util.Decoders$RecursiveControl.transaction(Decoders.java:77)
at org.embulk.spi.util.Decoders.transaction(Decoders.java:33)
at org.embulk.spi.FileInputRunner$RunnerControl.run(FileInputRunner.java:108)
at org.embulk.standards.LocalFileInputPlugin.resume(LocalFileInputPlugin.java:80)
at org.embulk.standards.LocalFileInputPlugin.transaction(LocalFileInputPlugin.java:70)
at org.embulk.spi.FileInputRunner.transaction(FileInputRunner.java:63)
at org.embulk.exec.PreviewExecutor.doPreview(PreviewExecutor.java:93)
at org.embulk.exec.PreviewExecutor.access$000(PreviewExecutor.java:27)
at org.embulk.exec.PreviewExecutor$1.run(PreviewExecutor.java:67)
at org.embulk.exec.PreviewExecutor$1.run(PreviewExecutor.java:63)
at org.embulk.spi.Exec.doWith(Exec.java:21)
at org.embulk.exec.PreviewExecutor.preview(PreviewExecutor.java:63)
at org.embulk.command.Runner.preview(Runner.java:240)
at org.embulk.command.Runner.main(Runner.java:100)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.jruby.javasupport.JavaMethod.invokeDirectWithExceptionHandling(JavaMethod.java:470)
at org.jruby.javasupport.JavaMethod.invokeDirect(JavaMethod.java:328)
at org.jruby.java.invokers.InstanceMethodInvoker.call(InstanceMethodInvoker.java:71)
at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:346)
at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:204)
at org.jruby.ast.CallTwoArgNode.interpret(CallTwoArgNode.java:59)
at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
at org.jruby.ast.RescueNode.executeBody(RescueNode.java:221)
at org.jruby.ast.RescueNode.interpret(RescueNode.java:116)
at org.jruby.ast.BeginNode.interpret(BeginNode.java:83)
at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
at org.jruby.ast.BlockNode.interpret(BlockNode.java:71)
at org.jruby.ast.CaseNode.interpret(CaseNode.java:138)
at org.jruby.ast.NewlineNode.interpret(NewlineNode.java:105)
at org.jruby.ast.BlockNode.interpret(BlockNode.java:71)
at org.jruby.evaluator.ASTInterpreter.INTERPRET_METHOD(ASTInterpreter.java:74)
at org.jruby.internal.runtime.methods.InterpretedMethod.call(InterpretedMethod.java:182)
at org.jruby.internal.runtime.methods.DefaultMethod.call(DefaultMethod.java:203)
at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:326)
at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:170)
at classpath_3a_embulk.command.embulk.__file__(classpath:embulk/command/embulk.rb:43)
at classpath_3a_embulk.command.embulk.load(classpath:embulk/command/embulk.rb)
at org.jruby.Ruby.runScript(Ruby.java:866)
at org.jruby.Ruby.runScript(Ruby.java:859)
at org.jruby.Ruby.runNormally(Ruby.java:728)
at org.jruby.Ruby.runFromMain(Ruby.java:577)
at org.jruby.Main.doRunFromMain(Main.java:395)
at org.jruby.Main.internalRun(Main.java:290)
at org.jruby.Main.run(Main.java:217)
at org.jruby.Main.main(Main.java:197)
at org.embulk.cli.Main.main(Main.java:13)
+-----------+-----------+-----------+-----------+-----------+
| c0:string | c1:string | c2:string | c3:string | c4:string |
+-----------+-----------+-----------+-----------+-----------+
| 1 | b | c | d | e |
| 3 | ,12345 | b | c | safe line |
+-----------+-----------+-----------+-----------+-----------+