[TECHSUCKS] STRANGE SED BEHAVIOUR
2007 April 23I have an IRC bot powered by ii running that automatically prints the content of the <title> Tag of any URL posted by itself without an explanation of the URL.
It does this this way:
wget -o /dev/null -O - "http://www.example.com/" | tr '\\n' ' ' | tr -d $'\\r' > tmp
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' tmp )
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' tmp )
With websites from Spiegel Online this gives problems I can't trace. I provide an example website on which the sed call gives different results based on the LANG environment variable:
export LANG=C
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' spiegel.html )
# title now contains the string between <title> and </title>
export LANG=en_US.utf8
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' spiegel.html )
# title now contains the string between <title> and </title> PLUS everything after an Ü character
sed --version
GNU sed version 4.1.5
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' spiegel.html )
# title now contains the string between <title> and </title>
export LANG=en_US.utf8
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' spiegel.html )
# title now contains the string between <title> and </title> PLUS everything after an Ü character
sed --version
GNU sed version 4.1.5
Can someone explain this?
EOF
Category: blog
Tags: TechSucks