Developers often think parser generators are the sole legit way to build programming language frontends, possibly because compiler courses in university teach lex/yacc variants. But do any modern programming languages actually use parser generators anymore?
To find out, this post presents a non-definitive survey of the parsing techniques used by various major programming language implementations.
Until CPython 3.10 (which hasn't been released yet) the default parser was built using pgen, a custom parser generator. The team thought the PEG parser was a better fit for expressing the language. At the time the switch from pgen to PEG parser improved speed 10% but increased memory usage by 10% as well.
The PEG grammar is defined here. (It is getting renamed in 3.10 though so check the directory for a file of a similar name if you browse 3.10+).
This section was corrected
by MegaIng on Reddit. Originally
I mistakenly claimed the previous parser was
handwritten. It was not.
Thanks J. Ryan Stinnett for
a correction about the change in speed in the new PEG parser.
Source code for the C parser available here. It used to use Bison until GCC 4.1 in 2006. The C++ parser also switched from Bison to a handwritten parser 2 years earlier.
Not only handwritten but the same file handles parsing C, Objective-C and C++. Source code is available here.
Ruby uses Bison. The grammar for the language can be found here.
Source code available here.
Source code available here.
Source code available here.
Source code for the grammar is available here.
Source code available here.
You can find the source code here.
Some older commentary calls this implementation fragile. But a Java contributor suggests the situation has improved since Java 8.
Until Go 1.6 the compiler used a yacc-based parser. The source code for that grammar is available here.
In Go 1.6 they switched to a handwritten parser. You can find that change here. There was a reported 18% speed increase when parsing files and a reported 3% speed increase in building the compiler itself when switching.
You can find the source code for the compiler's parser here.
The C# parser source code is available here. The Visual Basic parser source code is here.
A C# contributor mentioned a few key reasons for using a handwritten parser here.
Source code available here.
Source code available here.
I couldn't find it at first but Liorithiel showed me the parser source code is here.
Julia's parser is handwritten but not in Julia. It's in Scheme! Source code available here.
PostgreSQL uses Bison for parsing queries. Source code for the grammar available here.
Source code for the grammar available here.
SQLite uses its own parser generator called Lemon. Source code for the grammary is available here.
Of the 2021 Redmonk top 10 languages, 8 of them have a handwritten parser. Ruby and Python use parser generators.
Although parser generators are still used in major language implementations, maybe it's time for universities to start teaching handwritten parsing?
This tweet was published before I was corrected about Python's parser. It should say 8/10 but I cannot edit the tweet.
Let's actually survey the parsing techniques used by major programming languages in 2021 (with links to code 👾).
— Phil Eaton (@phil_eaton) August 21, 2021
In this post we discover that 9/10 of the top languages by @redmonk use a handwritten parser as opposed to a parser generator. 😱https://t.co/M69TqN78G5 pic.twitter.com/sGsdDmwshB