1   What’s New in Pyparsing 3.0.0

author:Paul McGuire
date:May, 2022
abstract:This document summarizes the changes made in the 3.0.0 release of pyparsing. (Updated to reflect changes up to 3.0.10)

1.1   New Features

1.1.1   PEP-8 naming

This release of pyparsing will (finally!) include PEP-8 compatible names and arguments. Backward-compatibility is maintained by defining synonyms using the old camelCase names pointing to the new snake_case names.

This code written using non-PEP8 names:

wd = pp.Word(pp.printables, excludeChars="$")
wd_list = pp.delimitedList(wd, delim="$")
print(wd_list.parseString("dkls$134lkjk$lsd$$").asList())

can now be written as:

wd = pp.Word(pp.printables, exclude_chars="$")
wd_list = pp.delimited_list(wd, delim="$")
print(wd_list.parse_string("dkls$134lkjk$lsd$$").as_list())

Pyparsing 3.0 will run both versions of this example.

New code should be written using the PEP-8 compatible names. The compatibility synonyms will be removed in a future version of pyparsing.

1.1.2   Railroad diagramming

An excellent new enhancement is the new railroad diagram generator for documenting pyparsing parsers.:

import pyparsing as pp

# define a simple grammar for parsing street addresses such
# as "123 Main Street"
#     number word...
number = pp.Word(pp.nums).set_name("number")
name = pp.Word(pp.alphas).set_name("word")[1, ...]

parser = number("house_number") + name("street")
parser.set_name("street address")

# construct railroad track diagram for this parser and
# save as HTML
parser.create_diagram('parser_rr_diag.html')

create_diagram accepts these named arguments:

  • vertical (int) - threshold for formatting multiple alternatives vertically instead of horizontally (default=3)
  • show_results_names - bool flag whether diagram should show annotations for defined results names
  • show_groups - bool flag whether groups should be highlighted with an unlabeled surrounding box
  • embed - bool flag whether generated HTML should omit <HEAD>, <BODY>, and <DOCTYPE> tags to embed the resulting HTML in an enclosing HTML source (new in 3.0.10)
  • head - str containing additional HTML to insert into the <HEAD> section of the generated code; can be used to insert custom CSS styling
  • body - str containing additional HTML to insert at the beginning of the <BODY> section of the generated code

To use this new feature, install the supporting diagramming packages using:

pip install pyparsing[diagrams]

See more in the examples directory: make_diagram.py and railroad_diagram_demo.py.

(Railroad diagram enhancement contributed by Michael Milton)

1.1.3   Support for left-recursive parsers

Another significant enhancement in 3.0 is support for left-recursive (LR) parsers. Previously, given a left-recursive parser, pyparsing would recurse repeatedly until hitting the Python recursion limit. Following the methods of the Python PEG parser, pyparsing uses a variation of packrat parsing to detect and handle left-recursion during parsing.:

import pyparsing as pp
pp.ParserElement.enable_left_recursion()

# a common left-recursion definition
# define a list of items as 'list + item | item'
# BNF:
#   item_list := item_list item | item
#   item := word of alphas
item_list = pp.Forward()
item = pp.Word(pp.alphas)
item_list <<= item_list + item | item

item_list.run_tests("""\
    To parse or not to parse that is the question
    """)

Prints:

['To', 'parse', 'or', 'not', 'to', 'parse', 'that', 'is', 'the', 'question']

See more examples in left_recursion.py in the pyparsing examples directory.

(LR parsing support contributed by Max Fischer)

1.1.4   Packrat/memoization enable and disable methods

As part of the implementation of left-recursion support, new methods have been added to enable and disable packrat parsing.

Name Description
enable_packrat Enable packrat parsing (with specified cache size)
enable_left_recursion Enable left-recursion cache
disable_memoization Disable all internal parsing caches

1.1.5   Type annotations on all public methods

Python 3.6 and upward compatible type annotations have been added to most of the public methods in pyparsing. This should facilitate developing pyparsing-based applications using IDEs for development-time type checking.

1.1.6   New string constants identchars and identbodychars to help in defining identifier Word expressions

Two new module-level strings have been added to help when defining identifiers, identchars and identbodychars.

Instead of writing:

import pyparsing as pp
identifier = pp.Word(pp.alphas + "_", pp.alphanums + "_")

you will be able to write:

identifier = pp.Word(pp.identchars, pp.identbodychars)

Those constants have also been added to all the Unicode string classes:

import pyparsing as pp
ppu = pp.pyparsing_unicode

cjk_identifier = pp.Word(ppu.CJK.identchars, ppu.CJK.identbodychars)
greek_identifier = pp.Word(ppu.Greek.identchars, ppu.Greek.identbodychars)

1.1.7   Refactored/added diagnostic flags

Expanded __diag__ and __compat__ to actual classes instead of just namespaces, to add some helpful behavior:

  • pyparsing.enable_diag() and pyparsing.disable_diag() methods to give extra help when setting or clearing flags (detects invalid flag names, detects when trying to set a __compat__ flag that is no longer settable). Use these methods now to set or clear flags, instead of directly setting to True or False:

    import pyparsing as pp
    pp.enable_diag(pp.Diagnostics.warn_multiple_tokens_in_named_alternation)
    
  • pyparsing.enable_all_warnings() is another helper that sets all “warn*” diagnostics to True:

    pp.enable_all_warnings()
    
  • added support for calling enable_all_warnings() if warnings are enabled using the Python -W switch, or setting a non-empty value to the environment variable PYPARSINGENABLEALLWARNINGS. (If using -Wd for testing, but wishing to disable pyparsing warnings, add -Wi:::pyparsing.)

  • added new warning, warn_on_match_first_with_lshift_operator to warn when using '<<' with a '|' MatchFirst operator, which will create an unintended expression due to precedence of operations.

    Example: This statement will erroneously define the fwd expression as just expr_a, even though expr_a | expr_b was intended, since '<<' operator has precedence over '|':

    fwd << expr_a | expr_b
    

    To correct this, use the '<<=' operator (preferred) or parentheses to override operator precedence:

    fwd <<= expr_a | expr_b
    

    or:

    fwd << (expr_a | expr_b)
    
  • warn_on_parse_using_empty_Forward - warns that a Forward has been included in a grammar, but no expression was attached to it using '<<=' or '<<'

  • warn_on_assignment_to_Forward - warns that a Forward has been created, but was probably later overwritten by erroneously using '=' instead of '<<=' (this is a common mistake when using Forwards) (currently not working on PyPy)

1.1.8   Support for yielding native Python list and dict types in place of ParseResults

To support parsers that are intended to generate native Python collection types such as lists and dicts, the Group and Dict classes now accept an additional boolean keyword argument aslist and asdict respectively. See the jsonParser.py example in the pyparsing/examples source directory for how to return types as ParseResults and as Python collection types, and the distinctions in working with the different types.

In addition parse actions that must return a value of list type (which would normally be converted internally to a ParseResults) can override this default behavior by returning their list wrapped in the new ParseResults.List class:

# this parse action tries to return a list, but pyparsing
# will convert to a ParseResults
def return_as_list_but_still_get_parse_results(tokens):
    return tokens.asList()

# this parse action returns the tokens as a list, and pyparsing will
# maintain its list type in the final parsing results
def return_as_list(tokens):
    return ParseResults.List(tokens.asList())

This is the mechanism used internally by the Group class when defined using aslist=True.

1.1.9   New Located class to replace locatedExpr helper method

The new Located class will replace the current locatedExpr method for marking parsed results with the start and end locations of the parsed data in the input string. locatedExpr had several bugs, and returned its results in a hard-to-use format (location data and results names were mixed in with the located expression’s parsed results, and wrapped in an unnecessary extra nesting level).

For this code:

wd = Word(alphas)
for match in locatedExpr(wd).search_string("ljsdf123lksdjjf123lkkjj1222"):
    print(match)

the docs for locatedExpr show this output:

[[0, 'ljsdf', 5]]
[[8, 'lksdjjf', 15]]
[[18, 'lkkjj', 23]]

The parsed values and the start and end locations are merged into a single nested ParseResults (and any results names in the parsed values are also merged in with the start and end location names).

Using Located, the output is:

[0, ['ljsdf'], 5]
[8, ['lksdjjf'], 15]
[18, ['lkkjj'], 23]

With Located, the parsed expression values and results names are kept separate in the second parsed value, and there is no extra grouping level on the whole result.

The existing locatedExpr is retained for backward-compatibility, but will be deprecated in a future release.

1.1.10   New AtLineStart and AtStringStart classes

As part of fixing some matching behavior in LineStart and StringStart, two new classes have been added: AtLineStart and AtStringStart.

LineStart and StringStart can be treated as separate elements, including whitespace skipping. AtLineStart and AtStringStart enforce that an expression starts exactly at column 1, with no leading whitespace.:

(LineStart() + Word(alphas)).parseString("ABC")    # passes
(LineStart() + Word(alphas)).parseString("  ABC")  # passes
AtLineStart(Word(alphas)).parseString("  ABC")     # fails

[This is a fix to behavior that was added in 3.0.0, but was actually a regression from 2.4.x.]

1.1.11   New IndentedBlock class to replace indentedBlock helper method

The new IndentedBlock class will replace the current indentedBlock method for defining indented blocks of text, similar to Python source code. Using IndentedBlock, the expression instance itself keeps track of the indent stack, so a separate external indentStack variable is no longer required.

Here is a simple example of an expression containing an alphabetic key, followed by an indented list of integers:

integer = pp.Word(pp.nums)
group = pp.Group(pp.Char(pp.alphas) + pp.IndentedBlock(integer))

parses:

A
    100
    101
B
    200
    201

as:

[['A', [100, 101]], ['B', [200, 201]]]

By default, the results returned from the IndentedBlock are grouped.

IndentedBlock may also be used to define a recursive indented block (containing nested indented blocks).

The existing indentedBlock is retained for backward-compatibility, but will be deprecated in a future release.

1.1.12   Shortened tracebacks

Cleaned up default tracebacks when getting a ParseException when calling parse_string. Exception traces should now stop at the call in parse_string, and not include the internal pyparsing traceback frames. (If the full traceback is desired, then set ParserElement.verbose_traceback to True.)

1.1.13   Improved debug logging

Debug logging has been improved by:

  • Including try/match/fail logging when getting results from the packrat cache (previously cache hits did not show debug logging). Values returned from the packrat cache are marked with an ‘*’.

  • Improved fail logging, showing the failed expression, text line, and marker where the failure occurred.

  • Adding with_line_numbers to pyparsing_testing. Use with_line_numbers to visualize the data being parsed, with line and column numbers corresponding to the values output when enabling set_debug() on an expression:

    data = """\
       A
          100"""
    expr = pp.Word(pp.alphanums).set_name("word").set_debug()
    print(ppt.with_line_numbers(data))
    expr[...].parseString(data)
    

    prints:

    .          1
      1234567890
    1:   A
    2:      100
    Match word at loc 3(1,4)
        A
        ^
    Matched word -> ['A']
    Match word at loc 11(2,7)
           100
           ^
    Matched word -> ['100']
    

1.1.14   New / improved examples

  • number_words.py includes a parser/evaluator to parse "forty-two" and return 42. Also includes example code to generate a railroad diagram for this parser.
  • BigQueryViewParser.py added to examples directory, submitted by Michael Smedberg.
  • booleansearchparser.py added to examples directory, submitted by xecgr. Builds on searchparser.py, adding support for ‘*’ wildcards and non-Western alphabets.
  • Improvements in select_parser.py, to include new SQL syntax from SQLite, submitted by Robert Coup.
  • Off-by-one bug found in the roman_numerals.py example, a bug that has been there for about 14 years! Submitted by Jay Pedersen.
  • A simplified Lua parser has been added to the examples (lua_parser.py).
  • Demonstration of defining a custom Unicode set for cuneiform symbols, as well as simple Cuneiform->Python conversion is included in cuneiform_python.py.
  • Fixed bug in delta_time.py example, when using a quantity of seconds/minutes/hours/days > 999.

1.1.15   Other new features

  • url expression added to pyparsing_common, with named fields for common fields in URLs. See the updated urlExtractorNew.py file in the examples directory. Submitted by Wolfgang Fahl.

  • DelimitedList now supports an additional flag allow_trailing_delim, to optionally parse an additional delimiter at the end of the list. Submitted by Kazantcev Andrey.

  • Added global method autoname_elements() to call set_name() on all locally defined ParserElements that haven’t been explicitly named using set_name(), using their local variable name. Useful for setting names on multiple elements when creating a railroad diagram:

    a = pp.Literal("a")
    b = pp.Literal("b").set_name("bbb")
    pp.autoname_elements()
    

    a will get named “a”, while b will keep its name “bbb”.

  • Enhanced default strings created for Word expressions, now showing string ranges if possible. Word(alphas) would formerly print as W:(ABCD...), now prints as W:(A-Za-z).

  • Better exception messages to show full word where an exception occurred.:

    Word(alphas)[...].parse_string("abc 123", parse_all=True)
    

    Was:

    pyparsing.ParseException: Expected end of text, found '1'  (at char 4), (line:1, col:5)
    

    Now:

    pyparsing.exceptions.ParseException: Expected end of text, found '123'  (at char 4), (line:1, col:5)
    
  • Using ... for SkipTo can now be wrapped in Suppress to suppress the skipped text from the returned parse results.:

    source = "lead in START relevant text END trailing text"
    start_marker = Keyword("START")
    end_marker = Keyword("END")
    find_body = Suppress(...) + start_marker + ... + end_marker
    print(find_body.parse_string(source).dump())
    

    Prints:

    ['START', 'relevant text ', 'END']
    - _skipped: ['relevant text ']
    
  • Added ignore_whitespace(recurse:bool = True) and added a recurse argument to leave_whitespace, both added to provide finer control over pyparsing’s whitespace skipping. Contributed by Michael Milton.

  • Added ParserElement.recurse() method to make it simpler for grammar utilities to navigate through the tree of expressions in a pyparsing grammar.

  • The repr() string for ParseResults is now of the form:

    ParseResults([tokens], {named_results})
    

    The previous form omitted the leading ParseResults class name, and was easily misinterpreted as a tuple containing a list and a dict.

  • Minor reformatting of output from run_tests to make embedded comments more visible.

  • New pyparsing_test namespace, assert methods and classes added to support writing unit tests.

    • assertParseResultsEquals
    • assertParseAndCheckList
    • assertParseAndCheckDict
    • assertRunTestResults
    • assertRaisesParseException
    • reset_pyparsing_context context manager, to restore pyparsing config settings
  • Enhanced error messages and error locations when parsing fails on the Keyword or CaselessKeyword classes due to the presence of a preceding or trailing keyword character.

  • Enhanced the Regex class to be compatible with re’s compiled with the re-equivalent regex module. Individual expressions can be built with regex compiled expressions using:

    import pyparsing as pp
    import regex
    
    # would use regex for this expression
    integer_parser = pp.Regex(regex.compile(r'\d+'))
    
  • Fixed handling of ParseSyntaxExceptions raised as part of Each expressions, when sub-expressions contain '-' backtrack suppression.

  • Potential performance enhancement when parsing Word expressions built from pyparsing_unicode character sets. Word now internally converts ranges of consecutive characters to regex character ranges (converting "0123456789" to "0-9" for instance).

  • Added a caseless parameter to the CloseMatch class to allow for casing to be ignored when checking for close matches. Contributed by Adrian Edwards.

1.2   API Changes

  • [Note added in pyparsing 3.0.7, reflecting a change in 3.0.0] Fixed a bug in the ParseResults class implementation of __bool__, which would formerly return False if the ParseResults item list was empty, even if it contained named results. Now ParseResults will return True if either the item list is not empty or if the named results list is not empty:

    # generate an empty ParseResults by parsing a blank string with a ZeroOrMore
    result = Word(alphas)[...].parse_string("")
    print(result.as_list())
    print(result.as_dict())
    print(bool(result))
    
    # add a results name to the result
    result["name"] = "empty result"
    print(result.as_list())
    print(result.as_dict())
    print(bool(result))
    

    Prints:

    []
    {}
    False
    
    []
    {'name': 'empty result'}
    True
    

    In previous versions, the second call to bool() would return False.

  • [Note added in pyparsing 3.0.4, reflecting a change in 3.0.0] The ParseResults class now uses __slots__ to pre-define instance attributes. This means that code written like this (which was allowed in pyparsing 2.4.7):

    result = Word(alphas).parseString("abc")
    result.xyz = 100
    

    now raises this Python exception:

    AttributeError: 'ParseResults' object has no attribute 'xyz'
    

    To add new attribute values to ParseResults object in 3.0.0 and later, you must assign them using indexed notation:

    result["xyz"] = 100
    

    You will still be able to access this new value as an attribute or as an indexed item.

  • enable_diag() and disable_diag() methods to enable specific diagnostic values (instead of setting them to True or False). enable_all_warnings() has also been added.

  • counted_array formerly returned its list of items nested within another list, so that accessing the items required indexing the 0’th element to get the actual list. This extra nesting has been removed. In addition, if there are other metadata fields parsed between the count and the list items, they can be preserved in the resulting list if given results names.

  • ParseException.explain() is now an instance method of ParseException:

    expr = pp.Word(pp.nums) * 3
    try:
        expr.parse_string("123 456 A789")
    except pp.ParseException as pe:
        print(pe.explain(depth=0))
    

    prints:

    123 456 A789
            ^
    ParseException: Expected W:(0-9), found 'A789'  (at char 8), (line:1, col:9)
    

    To run explain against other exceptions, use ParseException.explain_exception().

  • Debug actions now take an added keyword argument cache_hit. Now that debug actions are called for expressions matched in the packrat parsing cache, debug actions are now called with this extra flag, set to True. For custom debug actions, it is necessary to add support for this new argument.

  • ZeroOrMore expressions that have results names will now include empty lists for their name if no matches are found. Previously, no named result would be present. Code that tested for the presence of any expressions using "if name in results:" will now always return True. This code will need to change to "if name in results and results[name]:" or just "if results[name]:". Also, any parser unit tests that check the as_dict() contents will now see additional entries for parsers having named ZeroOrMore expressions, whose values will be [].

  • ParserElement.set_default_whitespace_chars will now update whitespace characters on all built-in expressions defined in the pyparsing module.

  • camelCase names have been converted to PEP-8 snake_case names.

    Method names and arguments that were camel case (such as parseString) have been replaced with PEP-8 snake case versions (parse_string).

    Backward-compatibility synonyms for all names and arguments have been included, to allow parsers written using the old names to run without change. The synonyms will be removed in a future release. New parser code should be written using the new PEP-8 snake case names.

Name Previous name
ParserElement  
  • parse_string
parseString
  • scan_string
scanString
  • search_string
searchString
  • transform_string
transformString
  • add_condition
addCondition
  • add_parse_action
addParseAction
  • can_parse_next
canParseNext
  • default_name
defaultName
  • enable_left_recursion
enableLeftRecursion
  • enable_packrat
enablePackrat
  • ignore_whitespace
ignoreWhitespace
  • inline_literals_using
inlineLiteralsUsing
  • parse_file
parseFile
  • leave_whitespace
leaveWhitespace
  • parse_string
parseString
  • parse_with_tabs
parseWithTabs
  • reset_cache
resetCache
  • run_tests
runTests
  • scan_string
scanString
  • search_string
searchString
  • set_break
setBreak
  • set_debug
setDebug
  • set_debug_actions
setDebugActions
  • set_default_whitespace_chars
setDefaultWhitespaceChars
  • set_fail_action
setFailAction
  • set_name
setName
  • set_parse_action
setParseAction
  • set_results_name
setResultsName
  • set_whitespace_chars
setWhitespaceChars
  • transform_string
transformString
  • try_parse
tryParse
ParseResults  
  • as_list
asList
  • as_dict
asDict
  • get_name
getName
ParseBaseException  
  • parser_element
parserElement
any_open_tag anyOpenTag
any_close_tag anyCloseTag
c_style_comment cStyleComment
common_html_entity commonHTMLEntity
condition_as_parse_action conditionAsParseAction
counted_array countedArray
cpp_style_comment cppStyleComment
dbl_quoted_string dblQuotedString
dbl_slash_comment dblSlashComment
DelimitedList delimitedList
DelimitedList delimited_list
dict_of dictOf
html_comment htmlComment
infix_notation infixNotation
java_style_comment javaStyleComment
line_end lineEnd
line_start lineStart
make_html_tags makeHTMLTags
make_xml_tags makeXMLTags
match_only_at_col matchOnlyAtCol
match_previous_expr matchPreviousExpr
match_previous_literal matchPreviousLiteral
nested_expr nestedExpr
null_debug_action nullDebugAction
one_of oneOf
OpAssoc opAssoc
original_text_for originalTextFor
python_style_comment pythonStyleComment
quoted_string quotedString
remove_quotes removeQuotes
replace_html_entity replaceHTMLEntity
replace_with replaceWith
rest_of_line restOfLine
sgl_quoted_string sglQuotedString
string_end stringEnd
string_start stringStart
token_map tokenMap
trace_parse_action traceParseAction
unicode_string unicodeString
with_attribute withAttribute
with_class withClass

1.3   Discontinued Features

1.3.1   Python 2.x no longer supported

Removed Py2.x support and other deprecated features. Pyparsing now requires Python 3.6.8 or later. If you are using an earlier version of Python, you must use a Pyparsing 2.4.x version.

1.3.2   Other discontinued features

  • ParseResults.asXML() - if used for debugging, switch to using ParseResults.dump(); if used for data transfer, use ParseResults.as_dict() to convert to a nested Python dict, which can then be converted to XML or JSON or other transfer format
  • operatorPrecedence synonym for infixNotation - convert to calling infix_notation
  • commaSeparatedList - convert to using pyparsing_common.comma_separated_list
  • upcaseTokens and downcaseTokens - convert to using pyparsing_common.upcase_tokens and downcase_tokens
  • __compat__.collect_all_And_tokens will not be settable to False to revert to pre-2.3.1 results name behavior - review use of names for MatchFirst and Or expressions containing And expressions, as they will return the complete list of parsed tokens, not just the first one. Use pyparsing.enable_diag(pyparsing.Diagnostics.warn_multiple_tokens_in_named_alternation) to help identify those expressions in your parsers that will have changed as a result.
  • Removed support for running python setup.py test. The setuptools maintainers consider the test command deprecated (see <https://github.com/pypa/setuptools/issues/1684>). To run the Pyparsing tests, use the command tox.

1.4   Fixed Bugs

  • [Reverted in 3.0.2]Fixed issue when LineStart() expressions would match input text that was not necessarily at the beginning of a line.

    [The previous behavior was the correct behavior, since it represents the LineStart as its own matching expression. ParserElements that must start in column 1 can be wrapped in the new AtLineStart class.]

  • Fixed bug in regex definitions for real and sci_real expressions in pyparsing_common.

  • Fixed FutureWarning raised beginning in Python 3.7 for Regex expressions containing ‘[’ within a regex set.

  • Fixed bug in PrecededBy which caused infinite recursion.

  • Fixed bug in CloseMatch where end location was incorrectly computed; and updated partial_gene_match.py example.

  • Fixed bug in indentedBlock with a parser using two different types of nested indented blocks with different indent values, but sharing the same indent stack.

  • Fixed bug in Each when using Regex, when Regex expression would get parsed twice.

  • Fixed bugs in Each when passed OneOrMore or ZeroOrMore expressions: . first expression match could be enclosed in an extra nesting level . out-of-order expressions now handled correctly if mixed with required expressions . results names are maintained correctly for these expression

  • Fixed FutureWarning that sometimes is raised when '[' passed as a character to Word.

  • Fixed debug logging to show failure location after whitespace skipping.

  • Fixed ParseFatalExceptions failing to override normal exceptions or expression matches in MatchFirst expressions.

  • Fixed bug in which ParseResults replaces a collection type value with an invalid type annotation (as a result of changed behavior in Python 3.9).

  • Fixed bug in ParseResults when calling __getattr__ for special double-underscored methods. Now raises AttributeError for non-existent results when accessing a name starting with ‘__’.

  • Fixed bug in Located class when used with a results name.

  • Fixed bug in QuotedString class when the escaped quote string is not a repeated character.

1.5   Acknowledgments

And finally, many thanks to those who helped in the restructuring of the pyparsing code base as part of this release. Pyparsing now has more standard package structure, more standard unit tests, and more standard code formatting (using black). Special thanks to jdufresne, klahnakoski, mattcarmody, ckeygusuz, tmiguelt, and toonarmycaptain to name just a few.

Thanks also to Michael Milton and Max Fischer, who added some significant new features to pyparsing.