Perl笔记

Tuesday, August 29, 2006

vi,sed,perl的正则表达式替换

VIM:
:1,5s/pattern/string/gi 1到5行
:%s/pattern/string/gi

sed:
sed -e s/pattern/string/gi originalfile > newfile #in FreeBSD: -E

perl:
#overwrite:
perl -pi -e ’s/bar/baz/’ fileA
perl -pi’*’ -e ’s/bar/baz/’ fileA
#backup to 'fileA.orig'
perl -pi’.orig’ -e ’s/bar/baz/’ fileA
perl -pi’*.orig’ -e ’s/bar/baz/’ fileA
#stdout
perl -pe ’s/bar/baz/’ fileA

Monday, August 28, 2006

url encode and url decode in Perl

url-encode:
$str =~ s/([^A-Za-z0-9])/sprintf("%%%02X", ord($1))/seg;
url-decode:
$str =~ s/\%([A-Fa-f0-9]{2})/pack('C', hex($1))/seg;

Onliner's:
  • Urlencode:对 \n 不转码
    perl -p -e 's/([^\w\-\.\@])/$1 eq "\n" ? "\n":sprintf("%%%2.2x",ord($1))/eg' keywords.list
  • UrlDecode:
    perl -p -e 's/%(..)/pack("c", hex($1))/eg' query.log

Really advanced perl RegEx reference

* Samples
交换两项位置
s/(\S+)\s+(\S+)/$2 $1/

搜索C语言标识符
m/[_A-Za-z][_A-Za-z0-9]*/
m/[_[:alpha:]][_[:alnum:]]*/

空行
/^$/

单词
\b\w+\b

* Questions

* Reference
perlre (bytes and utf8)
regex.h (regcomp regexec regfree regerror) (single byte only)
java (unicode only)
python (bytes and unicode)

*基本结构

*语法
m/regex/ismx
s/regex/replacement/ismxg

*修饰符
i 大小写一视同仁
s 单行模式 (.能够匹配所有的东东)
m 多行模式 (只影响 ^ $,使其匹配一个字符串内的多个行首/行尾)
x 允许空格和注释 (针对perl 有效)
g 全部(替换)

*等价
m/ABC|XYZ/

*序列
m/ABC/

*重复
(agressive)
a? 0 or 1
a* 0 or more
a+ 1 or more
a{m} m
a{m,} m or more
a{m,n} m to n (inclusively)

(懒惰)
a??
a*?
a+?
a{m}?
a{m,}?
a{m,n}?

aa
(a?)(a*) $1 => a a
(a??)(a*) $1 => "" aa

*原子
Character = a b c
Character Class
Escape = \ + non-alpha, such as \\, \+, \(, except reference
Meta Escape= \ + alpha[a-zA-Z]
Groups = (...)

* Character Class
[abc] [a-b] [^abc] [^abc0-9]
[- and [] are considered literal
[-a] = - or a
[^\-]

[[]
[]]
[ ]

* Posix Character Class
[[.a.]] collation
[[=a=]] equivalence
[[:alpha:]]

* Meta
. anything except newlines (normal mode)
. anything (s mode, singleline, dotall)
^ start of string, or start of line (m mode)
$ end of string (including newline), or end of line (m mode)

* Meta Escape
\t \n \r \f \a \e
\0nn \xnn 分别是八进制和16进制
\cA (using algorithm ch ^ 0x40)
\cM
\N{name}
\l lowercase next char
\u uppercase next char
\L...\E lowercase until \E
\U...\E uppercase until \E
\Q...\E quote until \E
\w \W word char
\s \S space
\d \D digit
\b \B boundary
\p{property}
\P{property}
\X combining character sequence
\C single byte (perl)
\<> end of word (emacs)

* Groups
(abc) for capture group

* Special group
(?#comment)
(?imsx-imsx) embedded flags
(?:pattern) for non-capture
(?imsx-imsx:pattern) subpattern
(?=pattern) positive look ahead
(?!pattern) negative look ahead
(?<=pattern) positive look behind
(?<!pattern) negative look behind

* Reference for capture
m/(x)\1/
s/(x)/$1$1/

*传统 vs 扩展
\{m,n\} vs {m,n}
\(xxx\) vs (xxx)
Emacs is still using traditional regular expression

* 特殊扩展
\<> end of word (emacs)

*换行

\n \v \r \r\n \f \x85 \x2028 \x2029 \x1A

Sed 与 Linux 等价命令代码鉴赏


basename sed 's/\(.*\)\/\([^/]*\)/\2/' or sed 's,.*/,,'
cat sed '' or sed -n '1,$p' or sed '1,$!d'
cat -s sed '/./,/^$/!d'
cat -n sed '=' | sed 'N;s/\n/\t/;s/^/ &/' or sed '=' | sed '$!N;s/\n/ /'
cat -E sed 's/$/\$/'
cat -t sed 's/\t/^I/g'
cut -c n sed 's/\(.\)\{n\}.*/\1/' or sed 's/^.\{(n-1)\}//g;s/\(.\)\(.*\)/\1/g'
cut -c x-y sed 's/\(^.\{y\}\)\(.*\)/\1/g;s/^.\{(x-1)\}//'
cut -d| -f6 sed 's/\(\([^|]*\)\|\)\{6\}.*/\2/'
cp file1 file2 sed 'w file2' file1
expand -t 1 sed 's/\t/ /g'
dirname sed 's/\(.*\)\/\([^/]*\)/\1/' or sed 's,[^/]*$,,'
grep patten sed -n '/patten/p' or sed '/patten/!d'
grep -v patten sed -n '/patten/!p' or sed '/pateen/d'
grep -n patten sed -n '/patten/{=;p}'| sed 'N;s/\n/:/'
head sed -n '1,10p'
head -1 sed -n '1p' or sed 'q'
head -Number sed '1,Number!d' or sed 'Numberq'
paste -s file1 file2 sed ':a;N;s/\n/\t/;ba;' file1 file2 | sed 's/\t\t/\n/'
paste -sdstr sed ':a;N;s/\n/str/;ba'
rev sed '/\n/!G;s/\(.\)\(.*\n\)/&\2\1/;//D;s/.//'
tac sed -n '1! G;$p;h' or sed -n 'G;$p;h'
tail -1 sed -n '$p' or sed '$!d'
tail -Number sed ':t;$q;N;(Number+1),$D;bt'
tail -f sed -u '/./!d'
tr "\n" " " sed ':a;N;s/\n/ /;ba'
tr "A-Z" "a-z" sed 'y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/'
tr "a-z" "A-Z" sed 'y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/'
tr -d "12" sed ':a;N;s/\n//;ba' or sed ':a;N;s/\(^.\)*\n\(.*\)/\2\1/;ba'
tr -s 'x' sed 's/\(x\)\(x\{1,\}\)/\1/'
tr -s ' ' sed 's/ \+/ /g'
uniq -u sed '$b;N;/^\(.*\)\n\1$/ ! {P;D};:c;$d;s/.*\n//;N;/^\(.*\)\n\1$/{bc};D'
uniq sed 'N;/^\(.*\)\n\1$/!P;D'
wc -l sed -n '$='
wc -c sed ':a;s/./&\n/;P;D;/.\{2,\}\n/ba' t|sed -n '$='
wc -w sed 's/ /\n/g' | sed -n '$='
xargs sed ':a;N;s/\n/ /;ba' or sed -e ':a' -e '$!N;s/\n/ /;ta'

Thursday, August 24, 2006

请别再用Perl 3了 - O'Reilly ONLamp Blog

Please Stop Using Perl 3 - O'Reilly ONLamp Blog

这篇文章的作者Curtis Poe的立意十分明确:为Perl辩护。因为Perl批判者经常抱怨:1.Perl的scalability不好,2.Perl的代码写写可以,就是没法读。

Poe强调了开发中被新手和业余程序员忽略的“separation of concerns, loose coupling and cohesive functions”。并举例说明perl支撑的应用之强健。

当然,Python的死忠分子自然可以继续攻击perl的易读性,不过,Poe认为“随便拿一段PHP代码,程序员一眼看过,也分不清是PHP还是Perl。倒是没什么人抱怨PHP难读”。

Tuesday, August 08, 2006

PXPerl 5.8.7-6的cpan模块编译

使用CPAN+MinGW编译模块,发现无法通过,后来发现,PXPerl 5.8.7-6在发布时少了个libperl58.a。重大失误啊!害得我干着急了两天!解决起来,唯一的办法还是把缺少的文件下载并保存到正确位置(PXPerl\lib\CORE\)。就这么简单!

让Encode::HanExtra起作用--chaoslawful的笔记

让Encode::HanExtra起作用--chaoslawful的笔记: "为了支持GB18030编码需要从CPAN上安装Encode::HanExtra模块,但Perl发行版本默认的Encode模块设置没有打开对HanExtra的支持,装了该模块也暂时无法使用GB18030编码。让模块有效的修改方法是:
修改Perl发行版的lib/Encode/Config.pm文件,其中的%ExtModule散列表定义了不同的编码对应的模块名称,搜索后会发现包括gb18030在内的几个编码对应的行被注释了,装完Encode::HanExtra模块以后手工去掉这几个注释即可。"

blogspot的ip的确是66.102.15.101啊!

也许你今天暂时又可以访问blogspot.com上的blog了。不过不要高兴的太早,也许这又是由于暂时更换ip导致gfw慢半拍造成的假象。

A photo test

Monday, August 07, 2006

摘录

 Perl with UTF-8 mode
Parsing a Querystring With Perl - A Simple ISINDEX Query (Page 2 of 3 )
What’s in a User-Agent String?

Have fun with google: detect encoding of a meaningful string

Mozilla有一套概率机制保证编码检测的准确,尤其是中日韩这样的大字符集. 其原理无非就是利用各个字符集的文字统计上出现频率的倒排表,比对目标文本中的频率,从而猜测文本的编码归属.

而google拥有全世界最大的可公开访问的网页数据库,从而形成了最为专业的文本检索排序的系统.如果利用google,检索某一个短字符串(比如由于各种原因常常混淆的中日韩MP3 id3标签),则有可能根据匹配的记录数推断文本到底属于哪个字符集.

由于id3字符串通常较短, 如果出现大量生僻字,则可能给单独依靠个别文字的识别带来麻烦.而依靠文字组合的概率模型,在本地实现较复杂,需要更大的存储空间,且不能跟踪时代潮流,造成升级的困扰,而google的数据库实时更新,且结果具有实际意义.综上,这些因素给利用google检索条目总数的判别方法留出空间.

实现:
  1. 得到MP3 id3tag字符串
  2. 将字符串分别按utf8, cp936, big5, euc-jp, euc-kr解码(得到utf8的测试编码)
  3. 分别将不同的测试编码提交给google.com,进行搜索
  4. 得到条目总数,取最多者
Todo:
  • 对得到正确的id3tag( v1/v2)的能力进行测试
  • 对正确地生成请求的url的测试
  • 对不同编码的判断结果进行大规模验证

把utf8字符输出成url格式

sub url_encode {
# default argument is $_
local $_ = @_ ? shift : $_;
defined or return;

# change unsafe characters (except for space) to encoded value
s/([^\w()'*~!.-])/sprintf '%%%02x', ord($1)/eg;

# change spaces to +
tr/ /+/;

return $_;
}



OSX下安装LWP



找到Terminal

perl -MCPAN -e shell

然后
install HTTP::Tagset


最后:
install LWP