More than 3 years have passed since last update.

【Java】JSoupでCAPTCHA認証を回避する

Last updated at 2022-06-07Posted at 2021-04-06

お前もか、JSoup

先日、「【Java】Selenium＋ChromeDriverで突然「あなたは人間ですか？」と聞かれた時は」という記事を書きました。
無事解決し、これでWebスクレイピングが捗るね！！と思っていたのですが……
今度は別のツールでSeleniumではなく、JSoupでCAPTCHA認証の罠が！！
年単位で！運用実績あるのに！！急に！使えなくなったよ！！！
実際にはJSoupでアクセスするとHTTPステータス403が返ってきて異常終了になるんすよ。
それは接続時にユーザーエージェントとか色々設定してあげれば解決はできるんですが……そうするとCAPTCHA認証に飛ばされちゃって……
それを乗り越えても取得先のURLによってはJavaScriptのシンタックスエラーが発生する……

結局JSoupだけではうまくいかなかったので、HTMLUnitを導入しました……
MavenとかGladleとか、必要に応じて依存関係を追加してください。
Maven Repository: net.sourceforge.htmlunit » htmlunit

ソース

取得後の処理を変えたくなかったので、JSoupのDocumentクラスに取得したHTMLのデータを突っ込んでます。

// HTNLUnit関連インポート文
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.Page;
import com.gargoylesoftware.htmlunit.WebClient;

// 実装
@SuppressWarnings("resource")
final WebClient webClient = new WebClient(BrowserVersion.BEST_SUPPORTED);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.setJavaScriptTimeout(10000);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.getOptions().setTimeout(10000);

Page page = webClient.getPage("取得先URL");
Document doc = Jsoup.parse(page.getWebResponse().getContentAsString());

ある日突然訪れるツールの死の対応で死ぬって、まあよくあるけどさあ……うん……

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up