I’m writing a parser combinator library in Python. An important thing that must be in the library is ability for a parser chain to perform some lookahead, so analogs of regex
(.*)foo are 1) useful, 2) work as expected, and not fail because
(.*) has gobbled up all input.
I’ve concocted a solution for this, which seems to work based on my testing, but I want to see if it can be improved in any way. My main concerns are
- Correctness – are weird edge-cases accomodated by my algorithm?
- Performance – could it be faster?
There’s also a couple of minor points I would like to know:
- Is simply attaching an extra field to a function like
reluctantdo acceptable, or should I wrap changed functions in a class instead?
- I’m doing
except ParsingEnd as end: raise endin a couple of places as a visual cue that I remember about those. Should I drop it?
The algorithm is as follows:
- Traverse the chain of parsers, feeding output of one to another
- If a parser that should perform lookahead is encountered, start adding all parsers to a ‘retry chain’, whether they perform lookahead or not.
- If some parser fails, fall back to the last parser with lookahead and change the portion of input it’s allowed to operate on.
- Reset input restrictions for all parsers to the right of this parser, and retry parsing from this point.
- If all combinations of input restrictions were tried unsuccessfully, fail.
Now for the code. The entire project can be seen in this repo: link, in a frozen branch
review-03022018 (core file). I’ll post critical points below.
chain function, where lookahead driver is located. It expects an iterable of parsers as the first argument, where a parser is just a callable that takes one State object and returns a new one. If a parser fails, it’s expected to raise a
ParsingFailure exception. A parser is marked as having lookahead capabilities with
def chain(funcs, combine=True, stop_on_failure=False): """ Create a parser that chains a given iterable of parsers together, using output of one parser as input for another. If 'combine' is truthy, combine 'parsed's of the parsers in the chain, otherwise use the last one. If 'stop_on_failure' is truthy, stop parsing instead of failing it when a parser in the chain raises a ParsingFailure exception. Note that this also includes parsers with lookahead, effectively disabling it. A note on using iterators in chains: if supplied 'funcs' is an iterator, there is a possibility of 'funcs' being exausted on the first attempt to parse, with subsequent attempts silently succeeding because that's the default behaviour for empty 'funcs'. If you want to run your chain several times - be it because of lookahead or for different reasons - make sure to wrap the iterator in list or some other *reusable* iterable (or call 'reuse_iter' on it, if it comes from a function). """ def res(state): """ A chain of parsers. """ pieces = _CachedAppender() if combine else None def maybe_combine(state): """ Concatenate 'parsed's if 'combine' is truthy. """ if combine: state.parsed = "".join(pieces) return state lookahead_chain = None for parser in funcs: if lookahead_chain is None and has_lookahead(parser): lookahead_chain = _CachedAppender() if lookahead_chain is not None or has_lookahead(parser): parser = _restrict(parser) lookahead_chain.append(parser) try: state = parser(state) if combine: pieces.append(state.parsed) except ParsingEnd as end: raise end except ParsingFailure as failure: if stop_on_failure: return maybe_combine(state) if lookahead_chain is None: raise failure pos = len(lookahead_chain) - 1 while True: try_shift = _shift(lookahead_chain, pos) if try_shift is None: raise ParsingFailure( "Failed to find a combination of inputs that allows " "successful parsing") _reset_chain(lookahead_chain, try_shift) state, failed = _try_chain(lookahead_chain, try_shift, pieces) if state is None: pos = failed continue break return maybe_combine(state) return res
There are a couple helper things that
chain uses (directly or not). The most important is
_RestrictedParser, which is a wrapper over a parser that remembers two things: the state it took as an argument the last time and which portion of input it is allowed to operate on:
class _RestrictedParser(): """ A parser that only operates on a restricted portion of input. """ def __init__(self, parser): self.parser = parser self.lookahead = get_lookahead(parser) self.delta = 0 self.state_before = None def __call__(self, state): self.state_before = state if self.lookahead is None: return self.parser(state) if self.lookahead is Lookahead.GREEDY: return _subparse(state, self.parser, len(state.left) - self.delta) # is reluctant return _subparse(state, self.parser, self.delta) def overrestricted(self): """ Return True if restrictions have reached their maximum - that is, if either allowed input portion is shrinked into an empty string, or has extended beyond the bounds of leftover input. """ return self.delta > len(self.state_before.left) def reset(self): """ Reset restrictions. """ self.delta = 0 self.state_before = None def restrict_more(self): """ Increase restriction level on the input. """ self.delta += 1
_subparse simply splits a State object in two at the specified position and uses a given parser on the first part, then joins leftovers with the second part.
The second important thing is
_shift, which attempts to restrict parsers starting at a given position and going to the left. If a parser cannot be restricted further, the next parser is also restricted (and if it also cannot, the next is tried, and so on). If all restriction configurations were tried the function will return
None, otherwise it’ll return the index of the last restricted parser.
_reset_chain helper function used by
chain simply resets restrictions on parsers starting at a given position and to the right.
_try_chain attempts to parse the state of its first
_RestrictedParser using current configuration of restrictions in the chain:
def _try_chain(parsers, from_pos, pieces): """ Try to parse the state the first parser in the chain remembers. Return a tuple (state, index of the first parser to fail). In case of failure, 'state' will be None. Also, if 'pieces' is not None, append every parsed chunk to it, having first dropped every invalidated piece. If an attempt to parse fails, 'pieces' will not be affected. """ state = parsers[from_pos].state_before i = from_pos new_pieces = None if pieces is None else deque() for i in range(from_pos, len(parsers)): try: state = parsers[i](state) if pieces is not None: new_pieces.append(state.parsed) except ParsingFailure: return (None, i) except ParsingEnd as end: raise end if pieces is not None: # '-1' is here because the last parser does not contribute a piece, as # it has failed pieces.drop(len(parsers) - from_pos - 1) pieces.extend(new_pieces) return state, i
The rest of used helpers are:
_CachedAppender– just a simple wrapper over a deque that sort of allows faster indexing if the deque changes not very often.
_restrict– wraps a parser with lookahead in
_RestrictedParserand does nothing to a parser without lookahead.
reluctant are simply:
def greedy(parser): """ Return a greedy version of 'parser'. """ try: if parser.lookahead is Lookahead.GREEDY: return parser except AttributeError: pass res = lambda state: parser(state) res.lookahead = Lookahead.GREEDY return res
I do not want to pollute the post with extraneous code so I don’t post the code of everything I mention, so if anything needs to be added, please tell me.
✓ Extra quality
ExtraProxies brings the best proxy quality for you with our private and reliable proxies
✓ Extra anonymity
Top level of anonymity and 100% safe proxies – this is what you get with every proxy package
✓ Extra speed
1,ooo mb/s proxy servers speed – we are way better than others – just enjoy our proxies!
USA proxy location
We offer premium quality USA private proxies – the most essential proxies you can ever want from USA
Our proxies have TOP level of anonymity + Elite quality, so you are always safe and secure with your proxies
Use your proxies as much as you want – we have no limits for data transfer and bandwidth, unlimited usage!
Superb fast proxy servers with 1,000 mb/s speed – sit back and enjoy your lightning fast private proxies!
99,9% servers uptime
Alive and working proxies all the time – we are taking care of our servers so you can use them without any problems
No usage restrictions
You have freedom to use your proxies with every software, browser or website you want without restrictions
Perfect for SEO
We are 100% friendly with all SEO tasks as well as internet marketing – feel the power with our proxies
Buy more proxies and get better price – we offer various proxy packages with great deals and discounts
We are working 24/7 to bring the best proxy experience for you – we are glad to help and assist you!