Mas sobre listas

Sig: La directiva require Sup: Análisis Sintáctico con Regexp::Grammars Ant: Llamadas privadas a subreglas Err: Si hallas una errata ...

Subsecciones

Mas sobre listas

Reconocimiento manual de listas

Analizando listas manualmente

El siguiente ejemplo muestra como construir un reconocedor de listas (posiblemente vacías) de números:

casiano@millo:~/Lregexp-grammar-examples$ cat -n simple_list.pl
     1  #!/soft/perl5lib/bin/perl5.10.1
     2  use v5.10;
     3
     4  use Regexp::Grammars;
     5
     6  my $list = qr{
     7      <List>
     8
     9      <rule: List>
    10           <digit> <List>
    11         | # empty
    12
    13      <rule: digit>
    14          <MATCH=(\d+)>
    15
    16  }xms;
    17
    18  while (my $input = <>) {
    19      chomp $input;
    20      if ($input =~ $list) {
    21          use Data::Dumper 'Dumper';
    22          warn Dumper \%/;
    23      }
    24      else {
    25        warn "Does not match\n"
    26      }
    27  }

Sigue una ejecución:

casiano@millo:~/Lregexp-grammar-examples$ ./simple_list.pl
2 3 4
$VAR1 = {
          '' => '2 3 4',
          'List' => {
                      '' => '2 3 4',
                      'digit' => '2'
                      'List' => {
                                  '' => '3 4',
                                  'digit' => '3'
                                  'List' => {
                                              '' => '4',
                                              'digit' => '4'
                                              'List' => '',
                                            },
                                },
                    }
        };

Influencia del orden en el lenguaje reconocido

Tenga en cuenta que el orden de las reglas influye en el lenguaje reconocido. Véase lo que ocurre si cambiamos en el ejemplo anterior el orden de las reglas:

casiano@millo:~/Lregexp-grammar-examples$ cat -n simple_list_empty_first.pl
     1  #!/soft/perl5lib/bin/perl5.10.1
     2  use v5.10;
     3
     4  use Regexp::Grammars;
     5
     6  my $list = qr{
     7      <List>
     8
     9      <rule: List>
    10           # empty
    11         | <digit> <List>
    12
    13      <rule: digit>
    14          <MATCH=(\d+)>
    15
    16  }xms;
    17
    18  while (my $input = <>) {
    19      chomp $input;
    20      if ($input =~ $list) {
    21          use Data::Dumper 'Dumper';
    22          warn Dumper \%/;
    23      }
    24      else {
    25        warn "Does not match\n"
    26      }
    27  }

Al ejecutar se obtiene:

casiano@millo:~/Lregexp-grammar-examples$ ./simple_list_empty_first.pl
2 3 4
$VAR1 = {
          '' => '',
          'List' => ''
        };

Por supuesto basta poner anclas en el patrón a buscar para forzar a que se reconozca la lista completa:

pl@nereida:~/Lregexpgrammars/demo$ diff simple_list_empty_first.pl simple_list_empty_first_with_anchors.pl
7c7
<     <List>
---
>     ^<List>$

En efecto, la nueva versión reconoce la lista:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 simple_list_empty_first_with_anchors.pl
2 3 4
$VAR1 = {
          '' => '2 3 4',
          'List' => {
                      'List' => {
                                  'List' => {
                                              'List' => '',
                                              '' => '4',
                                              'digit' => '4'
                                            },
                                  '' => '3 4',
                                  'digit' => '3'
                                },
                      '' => '2 3 4',
                      'digit' => '2'
                    }
        };

Si se quiere mantener la producción vacía en primer lugar pero forzar el reconocimiento de la lista completa, se puede hacer uso de un lookahead negativo:

pl@nereida:~/Lregexpgrammars/demo$ cat -n simple_list_empty_first_with_lookahead.pl
     1  #!/soft/perl5lib/bin/perl5.10.1
     2  use v5.10;
     3
     4  use strict;
     5  use Regexp::Grammars;
     6
     7  my $list = qr{
     8      <List>
     9
    10      <rule: List>
    11           (?! <digit> ) # still empty production
    12         | <digit> <List>
    13
    14      <rule: digit>
    15          <MATCH=(\d+)>
    16
    17  }xms;
    18
    19  while (my $input = <>) {
    20      chomp $input;
    21      if ($input =~ $list) {
    22          use Data::Dumper 'Dumper';
    23          warn Dumper \%/;
    24      }
    25      else {
    26        warn "Does not match\n"
    27      }
    28  }

Así, sólo se reducirá por la regla vacía si el siguiente token no es un número. Sigue un ejemplo de ejecución:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 simple_list_empty_first_with_lookahead.pl
2 3 4
$VAR1 = {
          '' => '2 3 4',
          'List' => {
                      'List' => {
                                  'List' => {
                                              'List' => '',
                                              '' => '4',
                                              'digit' => '4'
                                            },
                                  '' => '3 4',
                                  'digit' => '3'
                                },
                      '' => '2 3 4',
                      'digit' => '2'
                    }
        };

Aplanamiento manual de listas

¿Cómo podemos hacer que la estructura retornada por el reconocedor sea una lista?. Podemos añadir acciones como sigue:

casiano@millo:~/Lregexp-grammar-examples$ cat -n simple_list_action.pl
     1  #!/soft/perl5lib/bin/perl5.10.1
     2  use v5.10;
     3
     4  use Regexp::Grammars;
     5
     6  my $list = qr{
     7      <List>
     8
     9      <rule: List>
    10           <digit> <X=List> <MATCH= (?{ unshift @{$MATCH{X}}, $MATCH{digit}; $MATCH{X} })>
    11         | # empty
    12           <MATCH= (?{ [] })>
    13
    14      <rule: digit>
    15          <MATCH=(\d+)>
    16
    17  }xms;
    18
    19  while (my $input = <>) {
    20      chomp $input;
    21      if ($input =~ $list) {
    22          use Data::Dumper 'Dumper';
    23          warn Dumper \%/;
    24      }
    25      else {
    26        warn "Does not match\n"
    27      }
    28  }

Al ejecutarse este programa produce una salida como:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 simple_list_action.pl
2 3 4
$VAR1 = {
          '' => '2 3 4',
          'List' => [ '2', '3', '4' ]
        };

Los operadores de repetición

Los operadores de repetición como *, +, etc. permiten simplificar el análisis de lenguajes de listas:

pl@nereida:~/Lregexpgrammars/demo$ cat -n simple_list_star.pl
 1  #!/soft/perl5lib/bin/perl5.10.1
 2  use v5.10;
 3
 4  use Regexp::Grammars;
 5
 6  my $list = qr{
 7      <List>
 8
 9      <rule: List>
10          (?: <[digit]>)*
11
12      <rule: digit>
13          <MATCH=(\d+)>
14
15  }xms;
16
17  while (my $input = <>) {
18      chomp $input;
19      if ($input =~ $list) {
20          use Data::Dumper 'Dumper';
21          warn Dumper \%/;
22      }
23      else {
24        warn "Does not match\n"
25      }
26  }

Los corchetes alrededor de digit hacen que el valor asociado con el patrón sea la lista de números. Si no los ponemos el valor asociado sería el último valor de la lista.

Listas separadas por Algo

One of the commonest tasks in text parsing is to match a list of unspecified length, in which items are separated by a fixed token. Things like:

    1, 2, 3 , 4 ,13, 91        # Numbers separated by commas and spaces

    g-c-a-g-t-t-a-c-a          # Bases separated by dashes

    /usr/local/bin             # Names separated by directory markers

    /usr:/usr/local:bin        # Directories separated by colons

The usual construct required to parse these kinds of structures is either:

    <rule: list>

        <item> <separator> <list               # recursive definition
      | <item>                                 # base case

Or, more efficiently, but less prettily:

    <rule: list>

        <[item]> (?: <separator> <[item]> )*   # iterative definition

Because this is such a common requirement, Regexp::Grammars provides a cleaner way to specify the iterative version. The syntax is taken from Perl 6:

    <rule: list>

        <[item]> ** <separator>                # iterative definition

This is a repetition specifier on the first subrule (hence the use of ** as the marker, to reflect the repetitive behaviour of *). However, the number of repetitions is controlled by the second subrule: the first subrule will be repeatedly matched for as long as the second subrule matches immediately after it.

So, for example, you can match a sequence of numbers separated by commas with:

    <[number]> ** <comma>

    <token: number>  \d+
    <token: comma>   \s* , \s*

Note that it's important to use the <[...]> form for the items being matched, so that all of them are saved in the result hash. You can also save all the separators (if that's important):

    <[number]> ** <[comma]>

The repeated item must be specified as a subrule call fo some kind, but the separators may be specified either as a subrule or a bracketed pattern. For example:

    <[number]> ** ( , )

The separator must always be specified in matched delimiters of some kind: either matching <...> or matching (...). A common error is to write:

    <[number]> ** ,

You can also use a pattern as the item matcher, but it must be aliased into a subrule:

    <[item=(\d+)]> ** ( , )

Ejemplo: Listas de números separados por comas

Veamos un ejemplo sencillo:

casiano@millo:~/src/perl/regexp-grammar-examples$ cat -n demo_list.pl
 1  #!/soft/perl5lib/bin/perl5.10.1
 2  use v5.10;
 3
 4  use Regexp::Grammars;
 5
 6  my $list_nonempty = qr{
 7      <List>
 8
 9      <rule: List>
10          \(  <[Value]> ** (,)  \)
11
12      <token: Value>
13          \d+
14  }xms;
15
16  my $list_empty = qr{
17      <List>
18
19      <rule: List>
20          \(  (?: <[Value]> ** <_Sep=(,)> )?  \)
21
22      <token: Value>
23          \d+
24  }xms;
25
26  use Smart::Comments;
27
28
29  while (my $input = <>) {
30      my $input2 = $input;
31      if ($input =~ $list_nonempty) {
32          ### nonempty: $/{List}
33      }
34      if ($input2 =~ $list_empty) {
35          ### empty: $/{List}
36      }
37  }

Sigue un ejemplo de ejecución:

casiano@millo:~/src/perl/regexp-grammar-examples$ ./demo_list.pl
(3,4,5)

### nonempty: {
###             '' => '(3,4,5)',
###             Value => [
###                        '3',
###                        '4',
###                        '5'
###                      ]
###           }

### empty: {
###          '' => '(3,4,5)',
###          Value => [
###                     '3',
###                     '4',
###                     '5'
###                   ]
###        }
()

### empty: '()'

Ejemplo: AST para las expresiones aritméticas

Las expresiones aritméticas puede definirse como una jerarquía de listas como sigue:

pl@nereida:~/Lregexpgrammars/demo$ cat -n calcaslist.pl
 1  use strict;
 2  use warnings;
 3  use 5.010;
 4  use Data::Dumper;
 5  $Data::Dumper::Indent = 1;
 6
 7  my $rbb = do {
 8      use Regexp::Grammars;
 9
10      qr{
11        \A<expr>\z
12
13        <objrule: expr>      <[operands=term]> ** <[operators=addop]>
14
15        <objrule: term>      <[operands=uneg]> ** <[operators=mulop]>
16
17        <objrule: uneg>      <[operators=minus]>* <[operands=power]>
18
19        <objrule: power>     <[operands=factorial]> ** <[operators=powerop]>
20
21        <objrule: factorial> <[operands=factor]>  <[operators=(!)]>*
22
23        <objrule: factor>    <val=([+-]?\d+(?:\.\d*)?)>
24                           | \( <MATCH=expr> \)
25
26        <token: addop>        [+-]
27
28        <token: mulop>        [*/]
29
30        <token: powerop>      \*\*|\^
31
32        <token: minus>        - <MATCH=(?{ 'NEG' })>
33
34      }x;
35  };
36
37  while (my $input = <>) {
38      chomp($input);
39      if ($input =~ m{$rbb}) {
40          my $tree = $/{expr};
41          say Dumper $tree;
42
43      }
44      else {
45          say("does not match");
46      }
47  }

Obsérvese el árbol generado para la expresión 4-2-2:

pl@nereida:~/Lregexpgrammars/demo$ perl5.10.1 calcaslist.pl
4-2-2
$VAR1 = bless( {
  'operands' => [
    bless( {
      'operands' => [
        bless( {
          'operands' => [
            bless( {
              'operands' => [
                bless( {
                  'operands' => [
                    bless( { '' => '4', 'val' => '4' }, 'factor' )
                  ],
                  '' => '4'
                }, 'factorial' )
              ],
              '' => '4'
            }, 'power' )
          ],
          '' => '4'
        }, 'uneg' )
      ],
      '' => '4'
    }, 'term' ),
    bless( {
      'operands' => [
        bless( {
          'operands' => [
            bless( {
              'operands' => [
                bless( {
                  'operands' => [
                    bless( { '' => '2', 'val' => '2' }, 'factor' )
                  ],
                  '' => '2'
                }, 'factorial' )
              ],
              '' => '2'
            }, 'power' )
          ],
          '' => '2'
        }, 'uneg' )
      ],
      '' => '2'
    }, 'term' ),
    bless( {
      'operands' => [
        bless( {
          'operands' => [
            bless( {
              'operands' => [
                bless( {
                  'operands' => [
                    bless( { '' => '2', 'val' => '2' }, 'factor' )
                  ],
                  '' => '2'
                }, 'factorial' )
              ],
              '' => '2'
            }, 'power' )
          ],
          '' => '2'
        }, 'uneg' )
      ],
      '' => '2'
    }, 'term' )
  ],
  '' => '4-2-2',
  'operators' => [
    '-',
    '-'
  ]
}, 'expr' );

Sig: La directiva require Sup: Análisis Sintáctico con Regexp::Grammars Ant: Llamadas privadas a subreglas Err: Si hallas una errata ...

Casiano Rodríguez León
2013-03-05