• Revista PROGRAMAR: Já está disponível a edição #53 da revista programar. Faz já o download aqui!


[PyTeam] pyparsing - Regex para quem não sabe regex

5 mensagens neste tópico

Para os lazyasses como eu que não se deram ao trabalho de aprender a usar expressões regulares, o Paul McGuire criou o módulo pyparsing. Eu desconhecia a sua existencia até ver no meu Liferea um artigo sobre o mesmo, e fiquei mais que impressionado.

Aqui fica o artigo:

If you haven't used Paul McGuire's pyparsing module yet, you've been missing out on a great tool. Whenever you hit a wall trying to parse text with regular expressions or string operations, 'think pyparsing'.

I had the need to parse a load balancer configuration file and save certain values in a database. Most of the stuff I needed was fairly easily obtainable with regular expressions or Python string operations. However, I was stumped when I encountered a line such as:

bind http "Customer Server 1" http "Customer Server 2" http

This line 'binds' a 'virtual server' port to one or more 'real servers' and their ports (I'm using here this particular load balancer's jargon, but the concepts are the same for all load balancers.)

The syntax is 'bind' followed by a word denoting the virtual server port, followed by one or more pairs of real server names and ports. The kicker is that the real server names can be either a single word containing no whitespace, or multiple words enclosed in double quotes.

Splitting the line by spaces or double quotes is not the solution in this case. I started out by rolling my own little algorithm and keeping track of where I am inside the string, then I realized that I'm actually writing my own parser at this point. Time to reach for pyparsing.

I won't go into the details of how to use pyparsing, since there is great documentation available (see Paul's PyCon06 presentation, the examples on the pyparsing site, and also Paul's O'Reilly Shortcut book). Basically you need to define your grammar for the expression you need to parse, then translate it into pyparsing-specific constructs. Because pyparsing's API is so intuitive and powerful, the translation process is straightforward.

Here's how I ended up implementing my pyparsing grammar:

from pyparsing import *

def parse_bind_line(line):
   quoted_real_server = dblQuotedString.setParseAction(removeQuotes)
   real_server = Word(alphas, printables) | quoted_real_server
   port = Word(alphanums)
   real_server_port = Group(real_server + port)
   bind_expr = Suppress(Literal("bind")) + \
               port + \
   return bind_expr.parseString(line)

That's all there is to it. You need to read it from the bottom up to see how the expression gets decomposed into elements, and elements get decomposed into sub-elements.

I'll explain each line, starting with the last one before the return:

   bind_expr = Suppress(Literal("bind")) + \
               port + \

A bind expression starts with the literal "bind", followed by a port, followed by one or more real server/port pairs. That's pretty much what the line above actually says, isn't it. The Suppress construct tells pyparsing that we're not interested in returning the literal "bind" in the final token list.

   real_server_port = Group(real_server + port)

A real server/port pair is simply a real server name followed by a port. The Group construct tells pyparsing that we want to group these 2 tokens in a list inside the final token list.

   port = Word(alphanums)

A port is a word composed of alphanumeric characters. In general, word means 'a sequence of characters containing no whitespace'. The 'alphanums' variable is a special pyparsing variable already containing the list of alphanumeric characters.

   real_server = Word(alphas, printables) | quoted_real_server

A real server is either a single word, or an expression in quotes. Note that we can declare a pyparsing Word with 2 arguments; the 1st argument specifies the allowed characters for the initial character of the word, whereas the 2nd argument specified the allowed characters for the body of the word. In this case, we're saying that we want a real server name to start with an alphabetical character, but other than that it can contain any printable character.

   quoted_real_server = dblQuotedString.setParseAction(removeQuotes)

Here is where you can glimpse the power of pyparsing. With this single statement we're parsing a sequence of words enclosed in double quotes, and we're saying that we're not interested in the quotes. There's also a sglQuotedString class for words enclosed in single quotes. Thanks to Paul for bringing this to my attention. My clumsy attempt at manually declaring a sequence of words enclosed in double quotes ran something like this:

no_quote_word = Word(alphanums+"-.")
quoted_real_server = Suppress(Literal("\"")) + \
                      OneOrMore(no_quote_word) + \
quoted_real_server.setParseAction(lambda tokens: " ".join(tokens))

The only useful thing you can take away from this mumbo-jumbo is that you can associate an action with each token. When pyparsing will encounter that token, it will apply the action (function or class) you specified on that token. This is useful for doing validation of your tokens, for example for a date. Very powerful stuff.

Now it's time to test my function on a few strings:

if __name__ == "__main__":
   tests = """\
bind http "Customer Server 1" http "Customer Server 2" http
bind http "Customer Server - 11" 81 "Customer Server  12" 82
bind http www.mywebsite.com-server1 http www.mywebsite.com-server2 http
bind ssl www.mywebsite.com-server1 ssl www.mywebsite.com-server2 ssl
bind http TEST-server http
bind http MY-cluster-web11 83 MY-cluster-web-12 83
bind http cust1-server1.site.com http cust1-server2.site.com http

   for t in tests:
       print parse_bind_line(t)

Running the code above produces this output:

$ ./parse_bind.py

['http', ['Customer Server 1', 'http'], ['Customer Server 2', 'http']]

['http', ['Customer Server - 11', '81'], ['Customer Server  12', '82']]

['http', ['www.mywebsite.com-server1', 'http'], ['www.mywebsite.com-server2', 'http']]

['ssl', ['www.mywebsite.com-server1', 'ssl'], ['www.mywebsite.com-server2', 'ssl']]

['http', ['TEST-server', 'http']]

['http', ['MY-cluster-web11', '83'], ['MY-cluster-web-12', '83']]

['http', ['cust1-server1.site.com', 'http'], ['cust1-server2.site.com', 'http']]

From here, I was able to quickly identify for a given virtual server everything I needed: a virtual server port, and all the real server/port pairs associated with it. Inserting all this into a database was just another step. The hard work had already been done by pyparsing.

Once more, kudos to Paul McGuire for creating such an useful and fun tool.

in: http://agiletesting.blogspot.com/2008/03/easy-parsing-with-pyparsing.html

A editora O'Reilly até tem um livro sobre este módulo (escrito pelo autor do mesmo): Getting Started with Pyparsing

Welcome to the Pyparsing Wiki Home! - Download now from SourceForge!

The pyparsing module is an alternative approach to creating and executing simple grammars, vs. the traditional lex/yacc approach, or the use of regular expressions. The pyparsing module provides a library of classes that client code uses to construct the grammar directly in Python code.

in: Página do projecto


Partilhar esta mensagem

Link para a mensagem
Partilhar noutros sites

Gente.. desenterrando tópico  :P

Para quem tem/teve dificuldade de entender Regex (para que serve e/ou como funciona), recomendo veemente esta materia aqui:

[move]Clique na imagem para acessar a materia completa !!![/move]


O autor da materia explica o que é Regex (Expressões Regulares), de uma forma extremamente SIMPLES e COMPRIENCIVEL... duvido que ninguém aprende depois de ler !!

Recomendo  :(


Partilhar esta mensagem

Link para a mensagem
Partilhar noutros sites

E vamos 2 :( De compreensível pouco ou nada tem. Eu segui o tutorial de regex daqui: http://www.amk.ca/python/howto/regex/

E dei-me muito bem :thumbsup:

Quanto ao pyparsing, parece-me complicado. Ok, talvez não tão complicado como aprender decentemente regex, mas complicado o suficiente para não me fazer querer ver :x


Partilhar esta mensagem

Link para a mensagem
Partilhar noutros sites

¬¬'  :shocking:

Vc's estão acessando a materia completa ??, pois somente nela que entenderão a explicação, este desenho é meramete ilustrativo.

Sério.. nunca vi mais bem explicado !!  :confused:


Partilhar esta mensagem

Link para a mensagem
Partilhar noutros sites

Crie uma conta ou ligue-se para comentar

Só membros podem comentar

Criar nova conta

Registe para ter uma conta na nossa comunidade. É fácil!

Registar nova conta


Já tem conta? Inicie sessão aqui.

Entrar Agora