Parsing text in Ruby, part 2
Parse and transform with regular expressions
May 1, 2023 Ā· Felipe Vogel Ā·- Why not Parslet?
- Overview of my parsing + transforming
- Separating parsing out from transforming
- Taming regular expressions
- Conclusion: a feeling of lightness
In the first part of this series I described an epiphany that I had after discovering the Parslet gem: code for parsing text can become better-organized if I split off data transformation as a second step after parsing.
In the end, though, a question remained. It was fun to use Parslet in a throwaway script, but was I going to use Parslet in a real-world scenario? (Namely, my Reading gem, which has a small but very dedicated user base of one, myself.)
In other words, would it be worthwhile to rip out my messy-but-working parsing code to replace it with Parslet? Or was there a more incremental approach where I could use some of my existing code while still reaping the benefits of separate parse and transform steps?
Why not Parslet?
I decided to rework my existing code rather than use Parslet, for a few reasons:
- My Reading gem doesnāt depend on any other gems, and I wanted to keep it that way.
- Writing a parser with Parslet can be frustrating due to the fact that if the input doesnāt match the rules youāve set up, the parsing will fail entirely, throwing an error instead of giving parsed output, and itās up to you to figure out why. With regular expressions, itās more common to get wrong output that can give clues as to what went wrong.
- I had a feeling that Iād learn more by rolling my own solution. And I think I was right!
If you want to skip all the technical bits and go straight to my conclusion below, please feel free. I realize not everyone is as excited as I am about the minutiae of parsing a reading list š
Overview of my parsing + transforming
Hereās an overview of how my custom two-step parsing-and-transforming, copied from a comment in the top-level file of the gem:
# Architectural overview:
#
# (CSV input) (Items) (filtered Items)
# | Ī | Ī
# | | Ā·---. |
# | | | |
# V | V |
# ::parse | ::filter |
# | | | |
# | .----------> Item Filter
# Config, | / / \
# errors.rb ----- Parsing::CSV --Ā· Item::View Item::TimeLength
# / \
# Parsing::Parser Parsing::Transformer
# | |
# parsing/rows/* parsing/attributes/*
#
In a nutshell, input from a CSV file is fed into Reading::parse
, which passes it on to Parsing::CSV
, where Parsing::Parser
is used to produce an intermediate hash (with a structure mirroring the CSV columns), then Parsing::Transformer
is used to transform it into a final hash (with a structure based on item attributes and not CSV columns). An array of these attribute-based hashes, each representing an item, is returned from Reading::parse
.
We donāt need to worry about the right side of the diagram here, because thatās what happens after parsing and transforming in order to make their output more convenient for the user.
Now letās take a closer look at Parsing::Parser
. This is the first half of the two-step parse-and-transform, and itās inspired by Parsletās Parser. I wonāt get into the second step Parsing::Transformer
only because itās essentially my old code minus a bunch of strictly parsing code thatās been moved elsewhere, so that all thatās left is tidying up the parser output. Thatās a big improvement, but thereās not much else to say about it. So letās turn to the more interesting half of the equation: parsing.
Separating parsing out from transforming
Parsing::Parser
is where I added most of the new code. To reiterate a point I made in my last post, the problem with my parsing code was that it mixed up parsing and transforming into one big muddled mess. For each item attribute (title, author, length, etc.), the parser reached into the CSV row (sometimes across multiple columns) to grab the relevant substrings and process them to get the desired output. What made the code so hard to understand is that the grabbing and the processing (i.e. the parsing and the transforming) happened all together.
For example, here is the old parsing code for the variants attribute of an item, which can represent different editions of a book, the audiobook vs. the ebook, etc. The only reason that file is reasonably short is that most of the work is delegated out to four other files, each being equally long as this one. Splitting up messy code into smaller chunks of messy code is better than nothing, but I still often struggled to understand some of this code that I myself had written, so I knew it needed more work.
Now that Iāve separated parsing out from transforming that same file for the variants attribute has only 77 lines, as opposed to the 293 lines from before (counting the lines from the formerly required files). Itās not only shorter, but also easier to understand since it doesnāt mix finding CSV row substrings (parsing) with tidying them up (transforming).
Where did all those extra lines go? As I pulled out code that was strictly for parsing a CSV row, I noticed that my CSV columns share certain characteristics. I abstracted those into a Column
class which has a subclass for each of the columns. This way, instead of the parsing code being partially duplicated for each item attribute, all the parsing can happen in one place (in Parsing::Parser
) with variations determined by the Column
subclasses.
A simple example is the Genres column, whose main distinguishing characteristic is that itās a comma-separated list:
module Reading
module Parsing
module Rows
module Regular
# See https://github.com/fpsvogel/reading/blob/main/doc/csv-format.md#genres-column
class Genres < Column
def self.segment_separator
/,\s*/
end
def self.regexes(segment_index)
[%r{\A
(?<genre>.+)
\z}x]
end
end
end
end
end
end
At this point you may be thinking: Wait, this whole thing is built on regular expressions? š± Yes! Now I actually like regular expressions more than I did before, because this approach solved another problem with my old code: the unruliness of regular expressions.
Taming regular expressions
As you can see in the example above, each Column
subclass contains all the regular expressions needed for that columnās parsing. This is a big improvement over my previous approach of having regular expressions scattered throughout the parsing code and (better but still confusingly) in a config file.
Whatās more, I made use of the \x
multiline modifier to make regular expressions more readable with line breaks, indentation, and comments. I also stored re-used parts of regular expressions in constants and then interpolated them, using the \o
modifier to force the interpolation to be done only once, for efficiencyās sake.
The extreme example of a long regular expression is in the History column, home of a regular expression that is a whopping 83 lines long, counting the lines from an interpolated regular expression.
If I sound boastful of my monstrously long regular expression, itās because Iām proud that itās still readable, at least to me, and that was not the case even with much shorter one-line regular regular expressions in my old code, simply because itās hard to visually parse a regular expression on a single line if it has even two or three capturing groups or other elements within parentheses.
For example, here is one of the simpler regular expressions, the one for the Length column, jammed into one line:
/\A(((?<length_pages>\d+)p?|(?<length_time>\d+:\d\d))(\s+|\z))((?<each>each)|(x(?<repetitions>\d+)))?\z/
No mere mortal could read that regular expression. Breaking it up and adding comments makes a world of difference:
%r{\A
# length
(
(
(?<length_pages>\d+)p?
|
(?<length_time>\d+:\d\d)
)
(\s+|\z)
)
# each or repetitions, used in conjunction with the History column
(
# each
(?<each>each)
|
# repetitions
(
x
(?<repetitions>\d+)
)
)?
\z}x
Now we can see that itās made up of two halves: the length (in pages or time) and then either āeachā or a number of repetitions. So it could look something like 200p
or 1:30 each
or 0:20 x14
.
You could argue that regular expressions donāt look as clean as Parsletās DSL (hereās an example of Parsletās DSL), but for me thatās outweighed by how much more convenient regular expressions are, both because I already know them well and because theyāre easier to debug, especially with a regular expression sandbox like Rubular.
Conclusion: a feeling of lightness
Iām afraid this might be my most pedantic and tedious blog post to date, covering in great detail a very obscure projectāwhich, again, has known a user base of one. So I wanted to close by telling why this refactoring was significant enough for me to wax eloquent about it in this post.
Until recently, my progress in my Reading gem was stalled. I had one more column that needed to be implemented with parsing code, the History column. But that column was so complex, encompassing various ways of tracking your progress in a podcast, a book, or whatever else you read, listen to, or watch. Hereās more on what the History column looks like, if youāre curious, but my point here is that I was paralyzed and couldnāt bring myself to implement History column parsing because of how messy and hard to follow the parsing code was for the simpler columns that Iād already implemented. I shuddered to think of how long and unenjoyable it would be to implement the History column, and how impenetrable the code would be to me just a few days later.
And then I found Parslet, and it was like a beam of light breaking through dark clouds. (Cue uplifting choral music.) It showed me a way to organize my parsing (and now transforming) code in a way that not only made me excited to implement that last column, but also now makes me feel like a burden has been lifted, whose weight I felt every time I looked at my parsing code in my Reading gem. The code was a slog to read and to change, and I wasnāt happy with it despite my best efforts to clean it up. Now, post-refactor, I actually enjoy re-reading my code, and it all feels a lot lighter even with that massive History column added in.
Thanks for reading and following along in my adventure. I hope it inspires you to write code that gives you that same feeling of lightness when you read and re-read it.