[TECHSUCKS] STRANGE SED BEHAVIOUR
2007 April 23 | 2 commentsI have an IRC bot powered by ii running that automatically prints the content of the <title> Tag of any URL posted by itself without an explanation of the URL.
It does this this way:
wget -o /dev/null -O - "http://www.example.com/" | tr '\\n' ' ' | tr -d $'\\r' > tmp
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' tmp )
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' tmp )
With websites from Spiegel Online this gives problems I can't trace. I provide an example website on which the sed call gives different results based on the LANG environment variable:
export LANG=C
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' spiegel.html )
# title now contains the string between <title> and </title>
export LANG=en_US.utf8
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' spiegel.html )
# title now contains the string between <title> and </title> PLUS everything after an Ü character
sed --version
GNU sed version 4.1.5
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' spiegel.html )
# title now contains the string between <title> and </title>
export LANG=en_US.utf8
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' spiegel.html )
# title now contains the string between <title> and </title> PLUS everything after an Ü character
sed --version
GNU sed version 4.1.5
Can someone explain this?
EOF
Category: blog
Tags: TechSucks
2 Comments
From: flipflip
2007-04-24 08:31:11 +0200
I cannot reproduce your problem. But maybe the following works:\ntitle=`wget -qO- http://blog.crash-override.net/img/spiegel.html | sed 's,.*<title>\(.*\)</title>.*,\1,mi'`; echo $title\nKlimafolgen: China fürchtet dramatischen Rückgang der Reisproduktion - Wissenschaft - SPIEGEL ONLINE - Nachrichten\n
From: blindcoder
2007-04-24 08:31:53 +0200
I tried it, but with the same result. .* stops matching at 'Ü'.
Post a comment
All comments are held for moderation; basic HTML formatting is accepted.