htmlタグ取り(sed) 英語(65)

Last updated at 2024-05-12Posted at 2020-11-01

htmlファイルからtxtファイルを生成する方法を試している。

<この項は書きかけです。順次追記します。>

Delete html tags using sed/grep/awk
https://unix.stackexchange.com/questions/359980/delete-html-tags-using-sed-grep-awk
rmtag.sed

rmtag.sed

sed -e 's/<[^>]*>//g' index.* > indexall.txt

上記でうまくいかなかったのは、２行にまたがるタグ。
うまく取れなかった。

こんな感じ。

intexall.txt

	<section id="c188278" class="g-padding-top-5 g-padding-bottom-5
	    ">

改行を取ってから処理すればうまくいくかも。

bashで変数内の改行コードを除去する
https://hacknote.jp/archives/12122/

シェルスクリプトでの改行文字の削除にハマる
https://hacolab.hatenablog.com/entry/2019/07/29/235312

rmreturn.tr

tr -d '\n' < infile > outfile

なんとかうまくいきそう。

bat.sh

#!/bin/sh
# https://unix.stackexchange.com/questions/359980/delete-html-tags-using-sed-grep-awk
# https://hacolab.hatenablog.com/entry/2019/07/29/235312
# https://qiita.com/kaizen_nagoya/items/ad4f75b391ff95e72634

_LISPDIRS=`find . -maxdepth 1 -mindepth 1 -type d`

for _DIRS in ${_LISPDIRS}; do
    echo -e "\n-- ${_DIRS} --"
    cd $_DIRS
    tr -d '\n' < index.* >> /tmp/web/indexall.txt
    cd ../
done
sed -e 's/<[^>]*>//g' /tmp/web/indexall.txt > index.txt

これではまだ駄目だった。

【シェルスクリプト】ファイルやディレクトリの有無を確認する方法色々
https://www.server-memo.net/shellscript/file_check.html

docker/bash

# ./bat.sh
-e 
-- ./www.xxx.com --

./bat.sh: 11: ./bat.sh: cannot open index.*: No such file
# ls 
  index.html

bat.sh

#!/bin/sh
# https://unix.stackexchange.com/questions/359980/delete-html-tags-using-sed-grep-awk
# https://hacolab.hatenablog.com/entry/2019/07/29/235312
# https://qiita.com/kaizen_nagoya/items/ad4f75b391ff95e72634

_LISPDIRS=`find . -maxdepth 1 -mindepth 1 -type d`

for _DIRS in ${_LISPDIRS}; do
    echo -e "\n-- ${_DIRS} --"
    cd $_DIRS
    if [-e index.* ] then
      tr -d '\n' < index.html >> /tmp/web/indexall.txt
    fi
    cd ../
done
sed -e 's/<[^>]*>//g' /tmp/web/indexall.txt > index.txt

docker/bash

# ./bat.sh
./bat.sh: 13: ./bat.sh: Syntax error: "fi" unexpected (expecting "then")

＜この記事は個人の過去の経験に基づく個人の感想です。現在所属する組織、業務とは関係がありません。＞

最後までおよみいただきありがとうございました。

いいね　💚、フォローをお願いします。

Thank you very much for reading to the last sentence.

Please press the like icon 💚　and follow me for your happy life.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up