http://www.politiker-stopp.de/gfx/politiker-stopp-print.png

Benjamin Schieder

[TECHSUCKS] STRANGE SED BEHAVIOUR

2007 April 23 | 2 comments

I have an IRC bot powered by ii running that automatically prints the content of the <title> Tag of any URL posted by itself without an explanation of the URL.
It does this this way:

wget -o /dev/null -O - "http://www.example.com/" | tr '\\n' ' ' | tr -d $'\\r' > tmp
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' tmp )


With websites from Spiegel Online this gives problems I can't trace. I provide an example website on which the sed call gives different results based on the LANG environment variable:
export LANG=C
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' spiegel.html )
# title now contains the string between <title> and </title>
export LANG=en_US.utf8
read title < <( sed 's,^.*<title>(.*)</title>.*,1,g' spiegel.html )
# title now contains the string between <title> and </title> PLUS everything after an Ü character

sed --version
GNU sed version 4.1.5


Can someone explain this?


EOF

Category: blog

Tags: TechSucks


2 Comments

From: flipflip
2007-04-24 08:31:11 +0200

I cannot reproduce your problem. But maybe the following works:\ntitle=`wget -qO- http://blog.crash-override.net/img/spiegel.html | sed 's,.*<title>\(.*\)</title>.*,\1,mi'`; echo $title\nKlimafolgen: China fürchtet dramatischen Rückgang der Reisproduktion - Wissenschaft - SPIEGEL ONLINE - Nachrichten\n

From: blindcoder
2007-04-24 08:31:53 +0200

I tried it, but with the same result. .* stops matching at 'Ü'.

Post a comment

All comments are held for moderation; basic HTML formatting is accepted.

Name: (required)
E-mail: (required, not published)
Website: (optional)
Comment: