Un ejemplo sencillo

Sig: Depuración de Expresiones Regulares Sup: Introducción Ant: Introducción Err: Si hallas una errata ...

Subsecciones

Un ejemplo sencillo

Matching en Contexto Escalar

pl@nereida:~/Lperltesting$ cat -n c2f.pl
    1   #!/usr/bin/perl -w
    2   use strict;
    3 
    4   print "Enter a temperature (i.e. 32F, 100C):\n";
    5   my $input = <STDIN>;
    6   chomp($input);
    7 
    8   if ($input !~ m/^([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$/i) {
    9     warn "Expecting a temperature, so don't understand \"$input\".\n";
   10   }
   11   else {
   12     my $InputNum = $1;
   13     my $type = $3;
   14     my ($celsius, $farenheit);
   15     if ($type eq "C" or $type eq "c") {
   16       $celsius = $InputNum;
   17       $farenheit = ($celsius * 9/5)+32;
   18     }
   19     else {
   20       $farenheit = $InputNum;
   21       $celsius = ($farenheit -32)*5/9;
   22     }
   23     printf "%.2f C = %.2f F\n", $celsius, $farenheit;
   24   }

Véase también:

perldoc perlrequick
perldoc perlretut
perldoc perlre
perldoc perlreref

Ejecución con el depurador:

pl@nereida:~/Lperltesting$  perl -wd c2f.pl
Loading DB routines from perl5db.pl version 1.28
Editor support available.
Enter h or `h h' for help, or `man perldebug' for more help.
main::(c2f.pl:4):       print "Enter a temperature (i.e. 32F, 100C):\n";
DB<1>  c 8
Enter a temperature (i.e. 32F, 100C):
32F
main::(c2f.pl:8):       if ($input !~ m/^([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$/i) {
DB<2>  n
main::(c2f.pl:12):        my $InputNum = $1;
DB<2>  x ($1, $2, $3)
0  32
1  undef
2  'F'
DB<3>  use YAPE::Regex::Explain
DB<4>  p YAPE::Regex::Explain->new('([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$')->explain
The regular expression:
(?-imsx:([-+]?[0-9]+(\.[0-9]*)?)\s*([CF])$)
matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    [-+]?                    any character of: '-', '+' (optional
                             (matching the most amount possible))
----------------------------------------------------------------------
    [0-9]+                   any character of: '0' to '9' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    (                        group and capture to \2 (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      \.                       '.'
----------------------------------------------------------------------
      [0-9]*                   any character of: '0' to '9' (0 or
                               more times (matching the most amount
                               possible))
----------------------------------------------------------------------
    )?                       end of \2 (NOTE: because you're using a
                             quantifier on this capture, only the
                             LAST repetition of the captured pattern
                             will be stored in \2)
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    [CF]                     any character of: 'C', 'F'
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

Dentro de una expresión regular es necesario referirse a los textos que casan con el primer, paréntesis, segundo, etc. como \1, \2, etc. La notación $1 se refieré a lo que casó con el primer paréntesis en el último matching, no en el actual. Veamos un ejemplo:

pl@nereida:~/Lperltesting$ cat -n dollar1slash1.pl
    1   #!/usr/bin/perl -w
    2   use strict;
    3 
    4   my $a = "hola juanito";
    5   my $b = "adios anita";
    6 
    7   $a =~ /(ani)/;
    8   $b =~ s/(adios) *($1)/\U$1 $2/;
    9   print "$b\n";

Observe como el $1 que aparece en la cadena de reemplazo (línea 8) se refiere a la cadena adios mientras que el $1 en la primera parte contiene ani:

pl@nereida:~/Lperltesting$ ./dollar1slash1.pl
ADIOS ANIta

Ejercicio 3.1.1 Indique cuál es la salida del programa anterior si se sustituye la línea 8 por

$b =~ s/(adios) *(\1)/\U$1 $2/;

Número de Paréntesis

El número de paréntesis con memoria no está limitado:

pl@nereida:~/Lperltesting$ perl -wde 0
main::(-e:1):   0
            123456789ABCDEF
DB<1> $x = "123456789AAAAAA"
                   1  2  3  4  5  6  7  8  9 10 11  12
DB<2> $r = $x =~ /(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)\11/; print "$r\n$10\n$11\n"
1
A
A

Véase el siguiente párrafo de perlre (sección Capture buffers):

There is no limit to the number of captured substrings that you may use. However Perl also uses \10, \11, etc. as aliases for \010, \011, etc. (Recall that 0 means octal, so \011 is the character at number 9 in your coded character set; which would be the 10th character, a horizontal tab under ASCII.) Perl resolves this ambiguity by interpreting \10 as a backreference only if at least 10 left parentheses have opened before it. Likewise \11 is a backreference only if at least 11 left parentheses have opened before it. And so on. \1 through \9 are always interpreted as backreferences.

Contexto de Lista

Si se utiliza en un contexto que requiere una lista, el ``pattern match'' retorna una lista consistente en las subexpresiones casadas mediante los paréntesis, esto es $1, $2, $3, .... Si no hubiera emparejamiento se retorna la lista vacía. Si lo hubiera pero no hubieran paréntesis se retorna la lista ($&).

pl@nereida:~/src/perl/perltesting$ cat -n escapes.pl
     1  #!/usr/bin/perl -w
     2  use strict;
     3
     4  my $foo = "one two three four five\nsix seven";
     5  my ($F1, $F2, $Etc) = ($foo =~ /^\s*(\S+)\s+(\S+)\s*(.*)/);
     6  print "List Context: F1 = $F1, F2 = $F2, Etc = $Etc\n";
     7
     8  # This is 'almost' the same than:
     9  ($F1, $F2, $Etc) = split(/\s+/, $foo, 3);
    10  print "Split: F1 = $F1, F2 = $F2, Etc = $Etc\n";

Observa el resultado de la ejecución:

pl@nereida:~/src/perl/perltesting$ ./escapes.pl
List Context: F1 = one, F2 = two, Etc = three four five
Split: F1 = one, F2 = two, Etc = three four five
six seven

El modificador `s`

La opción s usada en una regexp hace que el punto '.' case con el retorno de carro:

pl@nereida:~/src/perl/perltesting$  perl -wd ./escapes.pl
main::(./escapes.pl:4): my $foo = "one two three four five\nsix seven";
DB<1>  c 9
List Context: F1 = one, F2 = two, Etc = three four five
main::(./escapes.pl:9): ($F1, $F2, $Etc) = split(' ',$foo, 3);
DB<2>  ($F1, $F2, $Etc) = ($foo =~ /^\s*(\S+)\s+(\S+)\s*(.*)/s)
DB<3>  p "List Context: F1 = $F1, F2 = $F2, Etc = $Etc\n"
List Context: F1 = one, F2 = two, Etc = three four five
six seven

La opción /s hace que . se empareje con un \n. Esto es, casa con cualquier carácter.

Veamos otro ejemplo, que imprime los nombres de los ficheros que contienen cadenas que casan con un patrón dado, incluso si este aparece disperso en varias líneas:

    1  #!/usr/bin/perl -w
    2  #use: 
    3  #smodifier.pl 'expr' files
    4  #prints the names of the files that match with the give expr
    5  undef $/; # input record separator
    6  my $what = shift @ARGV;
    7  while(my $file = shift @ARGV) {
    8    open(FILE, "<$file");
    9    $line =  <FILE>;
   10    if ($line =~ /$what/s) {
   11      print "$file\n";
   12    }
   13  }

Ejemplo de uso:

> smodifier.pl 'three.*three' double.in split.pl doublee.pl
double.in
doublee.pl

Vea la sección 3.4.2 para ver los contenidos del fichero double.in. En dicho fichero, el patrón three.*three aparece repartido entre varias líneas.

El modificador `m`

El modificador s se suele usar conjuntamente con el modificador m. He aquí lo que dice la seccion Using character classes de la sección 'Using-character-classes' en perlretut al respecto:

m modifier (//m): Treat string as a set of multiple lines. '.' matches any character except \n. ^ and $ are able to match at the start or end of any line within the string.
both s and m modifiers (//sm): Treat string as a single long line, but detect multiple lines. '.' matches any character, even \n . ^ and $ , however, are able to match at the start or end of any line within the string.

Here are examples of //s and //m in action:

   1. $x = "There once was a girl\nWho programmed in Perl\n";
   2.
   3. $x =~ /^Who/; # doesn't match, "Who" not at start of string
   4. $x =~ /^Who/s; # doesn't match, "Who" not at start of string
   5. $x =~ /^Who/m; # matches, "Who" at start of second line
   6. $x =~ /^Who/sm; # matches, "Who" at start of second line
   7.
   8. $x =~ /girl.Who/; # doesn't match, "." doesn't match "\n"
   9. $x =~ /girl.Who/s; # matches, "." matches "\n"
  10. $x =~ /girl.Who/m; # doesn't match, "." doesn't match "\n"
  11. $x =~ /girl.Who/sm; # matches, "." matches "\n"

Most of the time, the default behavior is what is wanted, but //s and //m are occasionally very useful. If //m is being used, the start of the string can still be matched with \A and the end of the string can still be matched with the anchors \Z (matches both the end and the newline before, like $), and \z (matches only the end):

   1. $x =~ /^Who/m; # matches, "Who" at start of second line
   2. $x =~ /\AWho/m; # doesn't match, "Who" is not at start of string
   3.
   4. $x =~ /girl$/m; # matches, "girl" at end of first line
   5. $x =~ /girl\Z/m; # doesn't match, "girl" is not at end of string
   6.
   7. $x =~ /Perl\Z/m; # matches, "Perl" is at newline before end
   8. $x =~ /Perl\z/m; # doesn't match, "Perl" is not at end of string

Normalmente el carácter ^ casa solamente con el comienzo de la cadena y el carácter $ con el final. Los \n empotrados no casan con ^ ni $. El modificador /m modifica esta conducta. De este modo ^ y $ casan con cualquier frontera de línea interna. Las anclas \A y \Z se utilizan entonces para casar con el comienzo y final de la cadena. Véase un ejemplo:

nereida:~/perl/src> perl -de 0
  DB<1> $a = "hola\npedro"
  DB<2> p "$a"
hola
pedro
  DB<3> $a =~ s/.*/x/m
  DB<4> p $a
x
pedro
  DB<5> $a =~ s/^pedro$/juan/
  DB<6> p "$a"
x
pedro
  DB<7> $a =~ s/^pedro$/juan/m
  DB<8>  p "$a"
x
juan

El conversor de temperaturas reescrito usando contexto de lista

Reescribamos el ejemplo anterior usando un contexto de lista:

casiano@millo:~/Lperltesting$ cat -n c2f_list.pl
    1   #!/usr/bin/perl -w
    2   use strict;
    3 
    4   print "Enter a temperature (i.e. 32F, 100C):\n";
    5   my $input = <STDIN>;
    6   chomp($input);
    7 
    8   my ($InputNum, $type);
    9 
   10   ($InputNum, $type) = $input =~ m/^
   11                                       ([-+]?[0-9]+(?:\.[0-9]*)?) # Temperature
   12                                       \s*
   13                                       ([cCfF]) # Celsius or Farenheit
   14                                    $/x;
   15 
   16   die "Expecting a temperature, so don't understand \"$input\".\n" unless defined($InputNum);
   17 
   18   my ($celsius, $fahrenheit);
   19   if ($type eq "C" or $type eq "c") {
   20     $celsius = $InputNum;
   21     $fahrenheit = ($celsius * 9/5)+32;
   22   }
   23   else {
   24     $fahrenheit = $InputNum;
   25     $celsius = ($fahrenheit -32)*5/9;
   26   }
   27   printf "%.2f C = %.2f F\n", $celsius, $fahrenheit;

La opción `x`

La opción /x en una regexp permite utilizar comentarios y espacios dentro de la expresión regular. Los espacios dentro de la expresión regular dejan de ser significativos. Si quieres conseguir un espacio que sea significativo, usa \s o bien escápalo. Véase la sección 'Modifiers' en perlre y la sección 'Building-a-regexp' en perlretut.

Paréntesis sin memoria

La notación (?: ... ) se usa para introducir paréntesis de agrupamiento sin memoria. (?: ...) Permite agrupar las expresiones tal y como lo hacen los paréntesis ordinarios. La diferencia es que no ``memorizan'' esto es no guardan nada en $1, $2, etc. Se logra así una compilación mas eficiente. Veamos un ejemplo:

> cat groupingpar.pl
#!/usr/bin/perl

  my $a = shift;

  $a =~ m/(?:hola )*(juan)/;
  print "$1\n";
nereida:~/perl/src> groupingpar.pl 'hola juan'
juan

Interpolación en los patrones: La opción `o`

El patrón regular puede contener variables, que serán interpoladas (en tal caso, el patrón será recompilado). Si quieres que dicho patrón se compile una sóla vez, usa la opción /o.

pl@nereida:~/Lperltesting$ cat -n mygrep.pl
     1  #!/usr/bin/perl -w
     2  my $what = shift @ARGV || die "Usage $0 regexp files ...\n";
     3  while (<>) {
     4    print "File $ARGV, rel. line $.: $_" if (/$what/o); # compile only once
     5  }
     6

Sigue un ejemplo de ejecución:

pl@nereida:~/Lperltesting$ ./mygrep.pl
Usage ./mygrep.pl regexp files ...
pl@nereida:~/Lperltesting$ ./mygrep.pl if labels.c
File labels.c, rel. line 7:        if (a < 10) goto LABEL;

El siguiente texto es de la sección 'Using-regular-expressions-in-Perl' en perlretut:

If $pattern won't be changing over the lifetime of the script, we can add the //o modifier, which directs Perl to only perform variable substitutions once

Otra posibilidad es hacer una compilación previa usando el operador qr (véase la sección 'Regexp-Quote-Like-Operators' en perlop). La siguiente variante del programa anterior también compila el patrón una sóla vez:

pl@nereida:~/Lperltesting$ cat -n mygrep2.pl
     1  #!/usr/bin/perl -w
     2  my $what = shift @ARGV || die "Usage $0 regexp files ...\n";
     3  $what = qr{$what};
     4  while (<>) {
     5    print "File $ARGV, rel. line $.: $_" if (/$what/);
     6  }

Véase

El nodo en perlmonks /o is dead, long live qr//! por diotalevi

Cuantificadores greedy

El siguiente extracto de la sección Matching Repetitions en la sección 'Matching-repetitions' en perlretut ilustra la semántica greedy de los operadores de repetición *+{}? etc.

For all of these quantifiers, Perl will try to match as much of the string as possible, while still allowing the regexp to succeed. Thus with /a?.../, Perl will first try to match the regexp with the a present; if that fails, Perl will try to match the regexp without the a present. For the quantifier * , we get the following:

   1. $x = "the cat in the hat";
   2. $x =~ /^(.*)(cat)(.*)$/; # matches,
   3. # $1 = 'the '
   4. # $2 = 'cat'
   5. # $3 = ' in the hat'

Which is what we might expect, the match finds the only cat in the string and locks onto it. Consider, however, this regexp:

   1. $x =~ /^(.*)(at)(.*)$/; # matches,
   2. # $1 = 'the cat in the h'
   3. # $2 = 'at'
   4. # $3 = '' (0 characters match)

One might initially guess that Perl would find the at in cat and stop there, but that wouldn't give the longest possible string to the first quantifier .*. Instead, the first quantifier .* grabs as much of the string as possible while still having the regexp match. In this example, that means having the at sequence with the final at in the string.

The other important principle illustrated here is that when there are two or more elements in a regexp, the leftmost quantifier, if there is one, gets to grab as much the string as possible, leaving the rest of the regexp to fight over scraps. Thus in our example, the first quantifier .* grabs most of the string, while the second quantifier .* gets the empty string. Quantifiers that grab as much of the string as possible are called maximal match or greedy quantifiers.

When a regexp can match a string in several different ways, we can use the principles above to predict which way the regexp will match:

Principle 0: Taken as a whole, any regexp will be matched at the earliest possible position in the string.
Principle 1: In an alternation a|b|c... , the leftmost alternative that allows a match for the whole regexp will be the one used.
Principle 2: The maximal matching quantifiers ?, *, + and {n,m} will in general match as much of the string as possible while still allowing the whole regexp to match.
Principle 3: If there are two or more elements in a regexp, the leftmost greedy quantifier, if any, will match as much of the string as possible while still allowing the whole regexp to match. The next leftmost greedy quantifier, if any, will try to match as much of the string remaining available to it as possible, while still allowing the whole regexp to match. And so on, until all the regexp elements are satisfied.

Regexp y Bucles Infinitos

El siguiente párrafo está tomado de la sección 'Repeated-Patterns-Matching-a-Zero-length-Substring' en perlre:

Regular expressions provide a terse and powerful programming language. As with most other power tools, power comes together with the ability to wreak havoc.

A common abuse of this power stems from the ability to make infinite loops using regular expressions, with something as innocuous as:

   1. 'foo' =~ m{ ( o? )* }x;

The o? matches at the beginning of 'foo' , and since the position in the string is not moved by the match, o? would match again and again because of the * quantifier.

Another common way to create a similar cycle is with the looping modifier //g :

   1. @matches = ( 'foo' =~ m{ o? }xg );

or

   1. print "match: <$&>\n" while 'foo' =~ m{ o? }xg;

or the loop implied by split().

... Perl allows such constructs, by forcefully breaking the infinite loop. The rules for this are different for lower-level loops given by the greedy quantifiers *+{} , and for higher-level ones like the /g modifier or split() operator.

The lower-level loops are interrupted (that is, the loop is broken) when Perl detects that a repeated expression matched a zero-length substring. Thus

   1.  m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;

is made equivalent to

   1.  m{ (?: NON_ZERO_LENGTH )*
   2.  |
   3.  (?: ZERO_LENGTH )?
   4.  }x;

The higher level-loops preserve an additional state between iterations: whether the last match was zero-length. To break the loop, the following match after a zero-length match is prohibited to have a length of zero. This prohibition interacts with backtracking (see Backtracking), and so the second best match is chosen if the best match is of zero length.

For example:

   1. $_ = 'bar';
   2. s/\w??/<$&>/g;

results in <><b><><a><><r><> . At each position of the string the best match given by non-greedy ?? is the zero-length match, and the second best match is what is matched by \w . Thus zero-length matches alternate with one-character-long matches.

Similarly, for repeated m/()/g the second-best match is the match at the position one notch further in the string.

The additional state of being matched with zero-length is associated with the matched string, and is reset by each assignment to pos(). Zero-length matches at the end of the previous match are ignored during split.

Ejercicio 3.1.2

Explique la conducta del siguiente matching:

  DB<25>  $c = 0

  DB<26>   print(($c++).": <$&>\n") while 'aaaabababab' =~ /a*(ab)*/g;
0: <aaaa>
1: <>
2: <a>
3: <>
4: <a>
5: <>
6: <a>
7: <>
8: <>

Cuantificadores lazy

Las expresiones lazy o no greedy hacen que el NFA se detenga en la cadena mas corta que casa con la expresión. Se denotan como sus análogas greedy añadiéndole el postfijo ?:

{n,m}?
{n,}?
{n}?
*?
+?
??

Repasemos lo que dice la sección Matching Repetitions en la sección 'Matching-repetitions' en perlretut:

Sometimes greed is not good. At times, we would like quantifiers to match a minimal piece of string, rather than a maximal piece. For this purpose, Larry Wall created the minimal match or non-greedy quantifiers ?? ,*?, +?, and {}?. These are the usual quantifiers with a ? appended to them. They have the following meanings:

a?? means: match 'a' 0 or 1 times. Try 0 first, then 1.
a*? means: match 'a' 0 or more times, i.e., any number of times, but as few times as possible
a+? means: match 'a' 1 or more times, i.e., at least once, but as few times as possible
a{n,m}? means: match at least n times, not more than m times, as few times as possible
a{n,}? means: match at least n times, but as few times as possible
a{n}? means: match exactly n times. Because we match exactly n times, an? is equivalent to an and is just there for notational consistency.

Let's look at the example above, but with minimal quantifiers:

   1. $x = "The programming republic of Perl";
   2. $x =~ /^(.+?)(e|r)(.*)$/; # matches,
   3. # $1 = 'Th'
   4. # $2 = 'e'
   5. # $3 = ' programming republic of Perl'

The minimal string that will allow both the start of the string ^ and the alternation to match is Th , with the alternation e|r matching e. The second quantifier .* is free to gobble up the rest of the string.

   1. $x =~ /(m{1,2}?)(.*?)$/; # matches,
   2. # $1 = 'm'
   3. # $2 = 'ming republic of Perl'

The first string position that this regexp can match is at the first m in programming . At this position, the minimal m{1,2}? matches just one m . Although the second quantifier .*? would prefer to match no characters, it is constrained by the end-of-string anchor $ to match the rest of the string.

   1. $x =~ /(.*?)(m{1,2}?)(.*)$/; # matches,
   2. # $1 = 'The progra'
   3. # $2 = 'm'
   4. # $3 = 'ming republic of Perl'

In this regexp, you might expect the first minimal quantifier .*? to match the empty string, because it is not constrained by a ^ anchor to match the beginning of the word. Principle 0 applies here, however. Because it is possible for the whole regexp to match at the start of the string, it will match at the start of the string. Thus the first quantifier has to match everything up to the first m. The second minimal quantifier matches just one m and the third quantifier matches the rest of the string.

   1. $x =~ /(.??)(m{1,2})(.*)$/; # matches,
   2. # $1 = 'a'
   3. # $2 = 'mm'
   4. # $3 = 'ing republic of Perl'

Just as in the previous regexp, the first quantifier .?? can match earliest at position a , so it does. The second quantifier is greedy, so it matches mm , and the third matches the rest of the string.

We can modify principle 3 above to take into account non-greedy quantifiers:

Principle 3: If there are two or more elements in a regexp, the leftmost greedy (non-greedy) quantifier, if any, will match as much (little) of the string as possible while still allowing the whole regexp to match. The next leftmost greedy (non-greedy) quantifier, if any, will try to match as much (little) of the string remaining available to it as possible, while still allowing the whole regexp to match. And so on, until all the regexp elements are satisfied.

Ejercicio 3.1.3 Explique cuál será el resultado de el segundo comando de matching introducido en el depurador:

casiano@millo:~/Lperltesting$ perl -wde 0
main::(-e:1):   0
  DB<1> x ('1'x34) =~ m{^(11+)\1+$}
0  11111111111111111
  DB<2> x ('1'x34) =~ m{^(11+?)\1+$}
????????????????????????????????????

Descripción detallada del proceso de matching

Veamos en detalle lo que ocurre durante un matching. Repasemos lo que dice la sección Matching Repetitions en la sección 'Matching-repetitions' en perlretut:

Just like alternation, quantifiers are also susceptible to backtracking. Here is a step-by-step analysis of the example

   1. $x = "the cat in the hat";
   2. $x =~ /^(.*)(at)(.*)$/; # matches,
   3. # $1 = 'the cat in the h'
   4. # $2 = 'at'
   5. # $3 = '' (0 matches)

Start with the first letter in the string 't'.
The first quantifier '.*' starts out by matching the whole string 'the cat in the hat'.
'a' in the regexp element 'at' doesn't match the end of the string. Backtrack one character.
'a' in the regexp element 'at' still doesn't match the last letter of the string 't', so backtrack one more character.
Now we can match the 'a' and the 't'.
Move on to the third element '.*'. Since we are at the end of the string and '.*' can match 0 times, assign it the empty string.
We are done!

Rendimiento

La forma en la que se escribe una regexp puede dar lugar agrandes variaciones en el rendimiento. Repasemos lo que dice la sección Matching Repetitions en la sección 'Matching-repetitions' en perlretut:

Most of the time, all this moving forward and backtracking happens quickly and searching is fast. There are some pathological regexps, however, whose execution time exponentially grows with the size of the string. A typical structure that blows up in your face is of the form

            /(a|b+)*/;

The problem is the nested indeterminate quantifiers. There are many different ways of partitioning a string of length n between the + and *: one repetition with b+ of length , two repetitions with the first b+ length and the second with length , repetitions whose bits add up to length , etc.

In fact there are an exponential number of ways to partition a string as a function of its length. A regexp may get lucky and match early in the process, but if there is no match, Perl will try every possibility before giving up. So be careful with nested *'s, {n,m}'s, and + 's.

The book Mastering Regular Expressions by Jeffrey Friedl [3] gives a wonderful discussion of this and other efficiency issues.

Eliminación de Comentarios de un Programa C

El siguiente ejemplo elimina los comentarios de un programa C.

casiano@millo:~/Lperltesting$ cat -n comments.pl
    1   #!/usr/bin/perl -w
    2   use strict;
    3 
    4   my $progname = shift @ARGV or die "Usage:\n$0 prog.c\n";
    5   open(my $PROGRAM,"<$progname") || die "can't find $progname\n";
    6   my $program = '';
    7   {
    8     local $/ = undef;
    9     $program = <$PROGRAM>;
   10   }
   11   $program =~ s{
   12     /\*  # Match the opening delimiter
   13     .*?  # Match a minimal number of characters
   14     \*/  # Match the closing delimiter
   15   }[]gsx;
   16 
   17   print $program;

Veamos un ejemplo de ejecución. Supongamos el fichero de entrada:

> cat hello.c
#include <stdio.h>
/* first
comment
*/
main() {
  printf("hello world!\n"); /* second comment */
}

Entonces la ejecución con ese fichero de entrada produce la salida:

> comments.pl hello.c
#include <stdio.h>
 
main() {
  printf("hello world!\n");
}

Veamos la diferencia de comportamiento entre * y *? en el ejemplo anterior:

pl@nereida:~/src/perl/perltesting$  perl5_10_1 -wde 0
main::(-e:1):   0
  DB<1>   use re 'debug'; 'main() /* 1c */ { /* 2c */ return; /* 3c */ }' =~ qr{(/\*.*\*/)}; print "\n$1\n"
Compiling REx "(/\*.*\*/)"
Final program:
   1: OPEN1 (3)
   3:   EXACT  (5)
   5:   STAR (7)
   6:     REG_ANY (0)
   7:   EXACT <*/> (9)
   9: CLOSE1 (11)
  11: END (0)
anchored "/*" at 0 floating "*/" at 2..2147483647 (checking floating) minlen 4
Guessing start of match in sv for REx "(/\*.*\*/)" against "main() /* 1c */ { /* 2c */ return; /* 3c */ }"
Found floating substr "*/" at offset 13...
Found anchored substr "/*" at offset 7...
Starting position does not contradict /^/m...
Guessed: match at offset 7
Matching REx "(/\*.*\*/)" against "/* 1c */ { /* 2c */ return; /* 3c */ }"
   7      |  1:OPEN1(3)
   7      |  3:EXACT (5)
   9 <() /*> < 1c */ { />    |  5:STAR(7)
                                  REG_ANY can match 36 times out of 2147483647...
  41 <; /* 3c > <*/ }>       |  7:  EXACT <*/>(9)
  43 <; /* 3c */> < }>       |  9:  CLOSE1(11)
  43 <; /* 3c */> < }>       | 11:  END(0)
Match successful!

/* 1c */ { /* 2c */ return; /* 3c */
Freeing REx: "(/\*.*\*/)"

  DB<2>   use re 'debug'; 'main() /* 1c */ { /* 2c */ return; /* 3c */ }' =~ qr{(/\*.*?\*/)}; print "\n$1\n"
Compiling REx "(/\*.*?\*/)"
Final program:
   1: OPEN1 (3)
   3:   EXACT  (5)
   5:   MINMOD (6)
   6:   STAR (8)
   7:     REG_ANY (0)
   8:   EXACT <*/> (10)
  10: CLOSE1 (12)
  12: END (0)
anchored "/*" at 0 floating "*/" at 2..2147483647 (checking floating) minlen 4
Guessing start of match in sv for REx "(/\*.*?\*/)" against "main() /* 1c */ { /* 2c */ return; /* 3c */ }"
Found floating substr "*/" at offset 13...
Found anchored substr "/*" at offset 7...
Starting position does not contradict /^/m...
Guessed: match at offset 7
Matching REx "(/\*.*?\*/)" against "/* 1c */ { /* 2c */ return; /* 3c */ }"
   7      |  1:OPEN1(3)
   7      |  3:EXACT (5)
   9 <() /*> < 1c */ { />    |  5:MINMOD(6)
   9 <() /*> < 1c */ { />    |  6:STAR(8)
                                  REG_ANY can match 4 times out of 4...
  13 <* 1c > <*/ { /* 2c>    |  8:  EXACT <*/>(10)
  15 <1c */> < { /* 2c *>    | 10:  CLOSE1(12)
  15 <1c */> < { /* 2c *>    | 12:  END(0)
Match successful!

/* 1c */
Freeing REx: "(/\*.*?\*/)"

  DB<3>

Véase también la documentación en la sección 'Matching-repetitions' en perlretut y la sección 'Quantifiers' en perlre.

Negaciones y operadores lazy

A menudo las expresiones X[^X]*X y X.*?X, donde X es un carácter arbitrario se usan de forma casi equivalente.

La primera significa:
Una cadena que no contiene X en su interior y que está delimitada por Xs
La segunda significa:
Una cadena que comienza en X y termina en la X mas próxima a la X de comienzo

Esta equivalencia se rompe si no se cumplen las hipótesis establecidas.

En el siguiente ejemplo se intentan detectar las cadenas entre comillas dobles que terminan en el signo de exclamación:

pl@nereida:~/Lperltesting$ cat -n negynogreedy.pl
     1  #!/usr/bin/perl -w
     2  use strict;
     3
     4  my $b = 'Ella dijo "Ana" y yo contesté: "Jamás!". Eso fué todo.';
     5  my $a;
     6  ($a = $b) =~ s/".*?!"/-$&-/;
     7  print "$a\n";
     8
     9  $b =~ s/"[^"]*!"/-$&-/;
    10  print "$b\n";

Al ejecutar el programa obtenemos:

> negynogreedy.pl
Ella dijo -"Ana" y yo contesté: "Jamás!"-. Eso fué todo.
Ella dijo "Ana" y yo contesté: -"Jamás!"-. Eso fué todo.

Copia y sustitución simultáneas

El operador de binding =~ nos permite ``asociar'' la variable con la operación de casamiento o sustitución. Si se trata de una sustitución y se quiere conservar la cadena, es necesario hacer una copia:

$d = $s;
$d =~ s/esto/por lo otro/;

en vez de eso, puedes abreviar un poco usando la siguiente ``perla'':

($d = $s) =~ s/esto/por lo otro/;

Obsérvese la asociación por la izquierda del operador de asignación.

Referencias a Paréntesis Previos

Las referencias relativas permiten escribir expresiones regulares mas reciclables. Véase la documentación en la sección 'Relative-backreferences' en perlretut:

Counting the opening parentheses to get the correct number for a backreference is errorprone as soon as there is more than one capturing group. A more convenient technique became available with Perl 5.10: relative backreferences. To refer to the immediately preceding capture group one now may write \g{-1} , the next but last is available via \g{-2}, and so on.

Another good reason in addition to readability and maintainability for using relative backreferences is illustrated by the following example, where a simple pattern for matching peculiar strings is used:

   1. $a99a = '([a-z])(\d)\2\1'; # matches a11a, g22g, x33x, etc.

Now that we have this pattern stored as a handy string, we might feel tempted to use it as a part of some other pattern:

   1. $line = "code=e99e";
   2. if ($line =~ /^(\w+)=$a99a$/){ # unexpected behavior!
   3.   print "$1 is valid\n";
   4. } else {
   5.   print "bad line: '$line'\n";
   6. }

But this doesn't match - at least not the way one might expect. Only after inserting the interpolated $a99a and looking at the resulting full text of the regexp is it obvious that the backreferences have backfired - the subexpression (\w+) has snatched number 1 and demoted the groups in $a99a by one rank. This can be avoided by using relative backreferences:

   1. $a99a = '([a-z])(\d)\g{-1}\g{-2}'; # safe for being interpolated

El siguiente programa ilustra lo dicho:

casiano@millo:~/Lperltesting$ cat -n backreference.pl
    1   use strict;
    2   use re 'debug';
    3 
    4   my $a99a = '([a-z])(\d)\2\1';
    5   my $line = "code=e99e";
    6   if ($line =~ /^(\w+)=$a99a$/){ # unexpected behavior!
    7     print "$1 is valid\n";
    8   } else {
    9     print "bad line: '$line'\n";
   10   }

Sigue la ejecución:

casiano@millo:~/Lperltesting$  perl5.10.1 -wd backreference.pl
main::(backreference.pl:4):     my $a99a = '([a-z])(\d)\2\1';
  DB<1>  c 6
main::(backreference.pl:6):     if ($line =~ /^(\w+)=$a99a$/){ # unexpected behavior!
  DB<2>  x ($line =~ /^(\w+)=$a99a$/)
  empty array
  DB<4>  $a99a = '([a-z])(\d)\g{-1}\g{-2}'
  DB<5>  x ($line =~ /^(\w+)=$a99a$/)
0  'code'
1  'e'
2  9

Usando Referencias con Nombre (Perl 5.10)

El siguiente texto esta tomado de la sección 'Named-backreferences' en perlretut:

Perl 5.10 also introduced named capture buffers and named backreferences. To attach a name to a capturing group, you write either (?<name>...) or (?'name'...). The backreference may then be written as \g{name} .

It is permissible to attach the same name to more than one group, but then only the leftmost one of the eponymous set can be referenced. Outside of the pattern a named capture buffer is accessible through the %+ hash.

Assuming that we have to match calendar dates which may be given in one of the three formats yyyy-mm-dd, mm/dd/yyyy or dd.mm.yyyy, we can write three suitable patterns where we use 'd', 'm' and 'y' respectively as the names of the buffers capturing the pertaining components of a date. The matching operation combines the three patterns as alternatives:

   1.  $fmt1 = '(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)';
   2.  $fmt2 = '(?<m>\d\d)/(?<d>\d\d)/(?<y>\d\d\d\d)';
   3.  $fmt3 = '(?<d>\d\d)\.(?<m>\d\d)\.(?<y>\d\d\d\d)';
   4.  for my $d qw( 2006-10-21 15.01.2007 10/31/2005 ){
   5.    if ( $d =~ m{$fmt1|$fmt2|$fmt3} ){
   6.      print "day=$+{d} month=$+{m} year=$+{y}\n";
   7.    }
   8.  }

If any of the alternatives matches, the hash %+ is bound to contain the three key-value pairs.

En efecto, al ejecutar el programa:

casiano@millo:~/Lperltesting$ cat -n namedbackreferences.pl
     1  use v5.10;
     2  use strict;
     3
     4  my $fmt1 = '(?<y>\d\d\d\d)-(?<m>\d\d)-(?<d>\d\d)';
     5  my $fmt2 = '(?<m>\d\d)/(?<d>\d\d)/(?<y>\d\d\d\d)';
     6  my $fmt3 = '(?<d>\d\d)\.(?<m>\d\d)\.(?<y>\d\d\d\d)';
     7
     8  for my $d qw( 2006-10-21 15.01.2007 10/31/2005 ){
     9    if ( $d =~ m{$fmt1|$fmt2|$fmt3} ){
    10      print "day=$+{d} month=$+{m} year=$+{y}\n";
    11    }
    12  }

Obtenemos la salida:

casiano@millo:~/Lperltesting$ perl5.10.1 -w namedbackreferences.pl
day=21 month=10 year=2006
day=15 month=01 year=2007
day=31 month=10 year=2005

Como se comentó:

... It is permissible to attach the same name to more than one group, but then only the leftmost one of the eponymous set can be referenced.

Veamos un ejemplo:

pl@nereida:~/Lperltesting$ perl5.10.1 -wdE 0
main::(-e:1):   0
DB<1>  # ... only the leftmost one of the eponymous set can be referenced
DB<2> $r = qr{(?<a>[a-c])(?<a>[a-f])}
DB<3> print $+{a} if 'ad' =~ $r
a
DB<4> print $+{a} if 'cf' =~ $r
c
DB<5> print $+{a} if 'ak' =~ $r

Reescribamos el ejemplo de conversión de temperaturas usando paréntesis con nombre:

pl@nereida:~/Lperltesting$ cat -n c2f_5_10v2.pl
 1  #!/usr/local/bin/perl5_10_1 -w
 2  use strict;
 3
 4  print "Enter a temperature (i.e. 32F, 100C):\n";
 5  my $input = <STDIN>;
 6  chomp($input);
 7
 8  $input =~ m/^
 9              (?<farenheit>[-+]?[0-9]+(?:\.[0-9]*)?)\s*[fF]
10              |
11              (?<celsius>[-+]?[0-9]+(?:\.[0-9]*)?)\s*[cC]
12           $/x;
13
14  my ($celsius, $farenheit);
15  if (exists $+{celsius}) {
16    $celsius = $+{celsius};
17    $farenheit = ($celsius * 9/5)+32;
18  }
19  elsif (exists $+{farenheit}) {
20    $farenheit = $+{farenheit};
21    $celsius = ($farenheit -32)*5/9;
22  }
23  else {
24    die "Expecting a temperature, so don't understand \"$input\".\n";
25  }
26
27  printf "%.2f C = %.2f F\n", $celsius, $farenheit;

La función exists retorna verdadero si existe la clave en el hash y falso en otro caso.

Grupos con Nombre y Factorización

El uso de nombres hace mas robustas y mas factorizables las expresiones regulares. Consideremos la siguiente regexp que usa notación posicional:

pl@nereida:~/Lperltesting$ perl5.10.1 -wde 0
main::(-e:1):   0
  DB<1> x "abbacddc" =~ /(.)(.)\2\1/
0  'a'
1  'b'

Supongamos que queremos reutilizar la regexp con repetición

  DB<2> x "abbacddc" =~ /((.)(.)\2\1){2}/
  empty array

¿Que ha ocurrido? La introducción del nuevo paréntesis nos obliga a renombrar las referencias a las posiciones:

  DB<3> x "abbacddc" =~ /((.)(.)\3\2){2}/
0  'cddc'
1  'c'
2  'd'
  DB<4> "abbacddc" =~ /((.)(.)\3\2){2}/; print "$&\n"
abbacddc

Esto no ocurre si utilizamos nombres. El operador \k<a> sirve para hacer referencia al valor que ha casado con el paréntesis con nombre a:

  DB<5> x "abbacddc" =~ /((?<a>.)(?<b>.)\k<b>\k<a>){2}/
0  'cddc'
1  'c'
2  'd'

El uso de grupos con nombre y \k^3.1en lugar de referencias numéricas absolutas hace que la regexp sea mas reutilizable.

LLamadas a expresiones regulares via paréntesis con memoria

Es posible también llamar a la expresión regular asociada con un paréntesis.

Este parrafo tomado de la sección 'Extended-Patterns' en perlre explica el modo de uso:

(?PARNO) (?-PARNO) (?+PARNO) (?R) (?0)

PARNO is a sequence of digits (not starting with 0) whose value reflects the paren-number of the capture buffer to recurse to.

....

Capture buffers contained by the pattern will have the value as determined by the outermost recursion. ....

If PARNO is preceded by a plus or minus sign then it is assumed to be relative, with negative numbers indicating preceding capture buffers and positive ones following. Thus (?-1) refers to the most recently declared buffer, and (?+1) indicates the next buffer to be declared.

Note that the counting for relative recursion differs from that of relative backreferences, in that with recursion unclosed buffers are included.

Veamos un ejemplo:

casiano@millo:~/Lperltesting$ perl5.10.1 -wdE 0
main::(-e:1):   0
  DB<1> x "AABB" =~ /(A)(?-1)(?+1)(B)/
0  'A'
1  'B'
  # Parenthesis:       1   2   2                  1
  DB<2> x 'ababa' =~ /^((?:([ab])(?1)\g{-1}|[ab]?))$/
0  'ababa'
1  'a'
  DB<3> x 'bbabababb' =~ /^((?:([ab])(?1)\g{-1}|[ab]?))$/
0  'bbabababb'
1  'b'

Véase también:

Reutilizando Expresiones Regulares

La siguiente reescritura de nuestro ejemplo básico utiliza el módulo Regexp::Common para factorizar la expresión regular:

casiano@millo:~/src/perl/perltesting$ cat -n c2f_5_10v3.pl
 1  #!/soft/perl5lib/bin/perl5.10.1 -w
 2  use strict;
 3  use Regexp::Common;
 4
 5  print "Enter a temperature (i.e. 32F, 100C):\n";
 6  my $input = <STDIN>;
 7  chomp($input);
 8
 9  $input =~ m/^
10              (?<farenheit>$RE{num}{real})\s*[fF]
11              |
12              (?<celsius>$RE{num}{real})\s*[cC]
13           $/x;
14
15  my ($celsius, $farenheit);
16  if ('celsius' ~~ %+) {
17    $celsius = $+{celsius};
18    $farenheit = ($celsius * 9/5)+32;
19  }
20  elsif ('farenheit' ~~ %+) {
21    $farenheit = $+{farenheit};
22    $celsius = ($farenheit -32)*5/9;
23  }
24  else {
25    die "Expecting a temperature, so don't understand \"$input\".\n";
26  }
27
28  printf "%.2f C = %.2f F\n", $celsius, $farenheit;

Véase:

La documentación del módulo Regexp::Common por Abigail
Smart Matching: Perl Training Australia: Smart Match
Rafael García Suárez: la sección 'Smart-matching-in-detail' en perlsyn
Enrique Nell (Barcelona Perl Mongers): Novedades en Perl 5.10

El Módulo Regexp::Common

El módulo Regexp::Common provee un extenso número de expresiones regulares que son accesibles vía el hash %RE. sigue un ejemplo de uso:

casiano@millo:~/Lperltesting$ cat -n regexpcommonsynopsis.pl
     1  use strict;
     2  use Perl6::Say;
     3  use Regexp::Common;
     4
     5  while (<>) {
     6      say q{a number}              if /$RE{num}{real}/;
     7
     8      say q{a ['"`] quoted string} if /$RE{quoted}/;
     9
    10      say q{a /.../ sequence}      if m{$RE{delimited}{'-delim'=>'/'}};
    11
    12      say q{balanced parentheses}  if /$RE{balanced}{'-parens'=>'()'}/;
    13
    14      die q{a #*@%-ing word}."\n"  if /$RE{profanity}/;
    15
    16  }
    17

Sigue un ejemplo de ejecución:

casiano@millo:~/Lperltesting$ perl regexpcommonsynopsis.pl
43
a number
"2+2 es" 4
a number
a ['"`] quoted string
x/y/z
a /.../ sequence
(2*(4+5/(3-2)))
a number
balanced parentheses
fuck you!
a #*@%-ing word

El siguiente fragmento de la documentación de Regexp::Common explica el modo simplificado de uso:

To access a particular pattern, %RE is treated as a hierarchical hash of hashes (of hashes...), with each successive key being an identifier. For example, to access the pattern that matches real numbers, you specify:

        $RE{num}{real}

and to access the pattern that matches integers:

        $RE{num}{int}

Deeper layers of the hash are used to specify flags: arguments that modify the resulting pattern in some way.

The keys used to access these layers are prefixed with a minus sign and may have a value;
if a value is given, it's done by using a multidimensional key.

For example, to access the pattern that matches base-2 real numbers with embedded commas separating groups of three digits (e.g. 10,101,110.110101101):

        $RE{num}{real}{-base => 2}{-sep => ','}{-group => 3}

Through the magic of Perl, these flag layers may be specified in any order (and even interspersed through the identifier keys!) so you could get the same pattern with:

        $RE{num}{real}{-sep => ','}{-group => 3}{-base => 2}

or:

        $RE{num}{-base => 2}{real}{-group => 3}{-sep => ','}

or even:

        $RE{-base => 2}{-group => 3}{-sep => ','}{num}{real}

etc.

Note, however, that the relative order of amongst the identifier keys is significant. That is:

        $RE{list}{set}

would not be the same as:

        $RE{set}{list}

Veamos un ejemplo con el depurador:

casiano@millo:~/Lperltesting$ perl -MRegexp::Common -wde 0
main::(-e:1):   0
  DB<1> x 'numero: 10,101,110.110101101 101.1e-1 234' =~ m{($RE{num}{real}{-base => 2}{-sep => ','}{-group => 3})}g
0  '10,101,110.110101101'
1  '101.1e-1'

La expresión regular para un número real es relativamente compleja:

casiano@millo:~/src/perl/perltesting$ perl5.10.1 -wd c2f_5_10v3.pl
main::(c2f_5_10v3.pl:5):     print "Enter a temperature (i.e. 32F, 100C):\n";
  DB<1> p $RE{num}{real}
(?:(?i)(?:[+-]?)(?:(?=[0123456789]|[.])(?:[0123456789]*)(?:(?:[.])(?:[0123456789]{0,}))?)(?:(?:[E])(?:(?:[+-]?)(?:[0123456789]+))|))

Si se usa la opción -keep el patrón proveído usa paréntesis con memoria:

casiano@millo:~/Lperltesting$ perl -MRegexp::Common -wde 0
main::(-e:1):   0
DB<2> x 'one, two, three, four, five' =~ /$RE{list}{-pat => '\w+'}/
0  1
DB<3> x 'one, two, three, four, five' =~ /$RE{list}{-pat => '\w+'}{-keep}/
0  'one, two, three, four, five'
1  ', '

Smart Matching

Perl 5.10 introduce el operador de smart matching. El siguiente texto es tomado casi verbatim del site de la compañía Perl Training Australia ^3.2:

Perl 5.10 introduces a new-operator, called smart-match, written ~~. As the name suggests, smart-match tries to compare its arguments in an intelligent fashion. Using smart-match effectively allows many complex operations to be reduces to very simple statements.

Unlike many of the other features introduced in Perl 5.10, there's no need to use the feature pragma to enable smart-match, as long as you're using 5.10 it's available.

The smart-match operator is always commutative. That means that $x ~~ $y works the same way as $y ~~ $x. You'll never have to remember which order to place to your operands with smart-match. Smart-match in action.

As a simple introduction, we can use smart-match to do a simple string comparison between simple scalars. For example:

    use feature qw(say);

    my $x = "foo";
    my $y = "bar";
    my $z = "foo";

    say '$x and $y are identical strings' if $x ~~ $y;
    say '$x and $z are identical strings' if $x ~~ $z;    # Printed

If one of our arguments is a number, then a numeric comparison is performed:

    my $num   = 100;
    my $input = <STDIN>;

    say 'You entered 100' if $num ~~ $input;

This will print our message if our user enters 100, 100.00, +100, 1e2, or any other string that looks like the number 100.

We can also smart-match against a regexp:

    my $input  = <STDIN>;

    say 'You said the secret word!' if $input ~~ /xyzzy/;

Smart-matching with a regexp also works with saved regexps created with qr.

So we can use smart-match to act like eq, == and =~, so what? Well, it does much more than that.

We can use smart-match to search a list:

casiano@millo:~/Lperltesting$ perl5.10.1 -wdE 0
main::(-e:1):   0
  DB<1> @friends = qw(Frodo Meriadoc Pippin Samwise Gandalf)
  DB<2> print "You're a friend" if 'Pippin' ~~ @friends
You're a friend
  DB<3> print "You're a friend" if 'Mordok' ~~ @friends

It's important to note that searching an array with smart-match is extremely fast. It's faster than using grep, it's faster than using first from Scalar::Util, and it's faster than walking through the loop with foreach, even if you do know all the clever optimisations.

Esta es la forma típica de buscar un elemento en un array en versiones anteriores a la 5.10:

casiano@millo:~$ perl -wde 0
main::(-e:1):   0
  DB<1> use List::Util qw{first}
  DB<2> @friends = qw(Frodo Meriadoc Pippin Samwise Gandalf)
  DB<3> x first { $_ eq 'Pippin'} @friends
0  'Pippin'
  DB<4> x first { $_ eq 'Mordok'} @friends
0  undef

We can also use smart-match to compare arrays:

  DB<4> @foo = qw(x y z xyzzy ninja)
  DB<5> @bar = qw(x y z xyzzy ninja)
  DB<7> print "Identical arrays" if @foo ~~ @bar
Identical arrays
  DB<8> @bar = qw(x y z xyzzy nOnjA)
  DB<9> print "Identical arrays" if @foo ~~ @bar
  DB<10>

And even search inside an array using a string:

 DB<11> x @foo = qw(x y z xyzzy ninja)
0  'x'
1  'y'
2  'z'
3  'xyzzy'
4  'ninja'
  DB<12> print "Array contains a ninja " if @foo ~~ 'ninja'

or using a regexp:

  DB<13> print "Array contains magic pattern" if @foo ~~ /xyz/
Array contains magic pattern
  DB<14> print "Array contains magic pattern" if @foo ~~ /\d+/

Smart-match works with array references, too^3.3:

  DB<16> $array_ref = [ 1..10 ]
  DB<17> print "Array contains 10" if 10 ~~ $array_ref
Array contains 10
  DB<18> print "Array contains 10" if $array_ref ~~ 10
  DB<19>

En el caso de un número y un array devuelve cierto si el escalar aparece en un array anidado:

casiano@millo:~/Lperltesting$ perl5.10.1 -E 'say "ok" if 42 ~~  [23, 17, [40..50], 70];'
ok
casiano@millo:~/Lperltesting$ perl5.10.1 -E 'say "ok" if 42 ~~  [23, 17, [50..60], 70];'
casiano@millo:~/Lperltesting$

Of course, we can use smart-match with more than just arrays and scalars, it works with searching for the key in a hash, too!

  DB<19> %colour = ( sky   => 'blue', grass => 'green', apple => 'red',)
  DB<20> print "I know the colour" if 'grass' ~~ %colour
I know the colour
  DB<21> print "I know the colour" if 'cloud' ~~ %colour
  DB<22>
  DB<23> print "A key starts with 'gr'" if %colour ~~ /^gr/
A key starts with 'gr'
  DB<24> print "A key starts with 'clou'" if %colour ~~ /^clou/
  DB<25>

You can even use it to see if the two hashes have identical keys:

  DB<26> print 'Hashes have identical keys' if %taste ~~ %colour;
Hashes have identical keys

La conducta del operador de smart matching viene dada por la siguiente tabla tomada de la sección 'Smart-matching-in-detail' en perlsyn:

The behaviour of a smart match depends on what type of thing its arguments are. The behaviour is determined by the following table: the first row that applies determines the match behaviour (which is thus mostly determined by the type of the right operand). Note that the smart match implicitly dereferences any non-blessed hash or array ref, so the "Hash" and "Array" entries apply in those cases. (For blessed references, the "Object" entries apply.)

Note that the "Matching Code" column is not always an exact rendition. For example, the smart match operator short-circuits whenever possible, but grep does not.

 $a      $b        Type of Match Implied    Matching Code
 ======  =====     =====================    =============
 Any     undef     undefined                !defined $a

 Any     Object    invokes ~~ overloading on $object, or dies

 Hash    CodeRef   sub truth for each key[1] !grep { !$b->($_) } keys %$a
 Array   CodeRef   sub truth for each elt[1] !grep { !$b->($_) } @$a
 Any     CodeRef   scalar sub truth          $b->($a)

 Hash    Hash      hash keys identical (every key is found in both hashes)
 Array   Hash      hash slice existence     grep { exists $b->{$_} } @$a
 Regex   Hash      hash key grep            grep /$a/, keys %$b
 undef   Hash      always false (undef can't be a key)
 Any     Hash      hash entry existence     exists $b->{$a}

 Hash    Array     hash slice existence     grep { exists $a->{$_} } @$b
 Array   Array     arrays are comparable[2]
 Regex   Array     array grep               grep /$a/, @$b
 undef   Array     array contains undef     grep !defined, @$b
 Any     Array     match against an array element[3]
                                            grep $a ~~ $_, @$b

 Hash    Regex     hash key grep            grep /$b/, keys %$a
 Array   Regex     array grep               grep /$b/, @$a
 Any     Regex     pattern match            $a =~ /$b/

 Object  Any       invokes ~~ overloading on $object, or falls back:
 Any     Num       numeric equality         $a == $b
 Num     numish[4] numeric equality         $a == $b
 undef   Any       undefined                !defined($b)
 Any     Any       string equality          $a eq $b

Ejercicios

Ejercicio 3.1.4

Indique la salida del siguiente programa:

    1 pl@nereida:~/Lperltesting$ cat twonumbers.pl
    2 $_ = "I have 2 numbers: 53147";
    3 @pats = qw{
    4   (.*)(\d*)
    5   (.*)(\d+)
    6   (.*?)(\d*)
    7   (.*?)(\d+)
    8   (.*)(\d+)$
    9   (.*?)(\d+)$
   10   (.*)\b(\d+)$
   11   (.*\D)(\d+)$
   12 };
   13 
   14 print "$_\n";
   15 for $pat (@pats) {
   16   printf "%-12s ", $pat;
   17   <>;
   18   if ( /$pat/ ) {
   19     print "<$1> <$2>\n";
   20   } else {
   21     print "FAIL\n";
   22   }
   23 }

Sig: Depuración de Expresiones Regulares Sup: Introducción Ant: Introducción Err: Si hallas una errata ...

Casiano Rodríguez León
2013-03-05