Parsing basics - how to use Parsers.jl to parse formats for high performance

  1. Parsing basics - how to use Parsers.jl to parse formats for high performance
  2. What is parsing?
  3. Why should I care?
  4. How fast can you go?
  5. Oooh - parsing tricks!
  6. The format to be parsed
      1. TODO ???
      2. Success! 🔥

~ 20 minutes.

What is parsing?

For us, parsing involves reading data from a file. They usually have a specification for how the file should be structured.

Why should I care?

Some programming tasks will require you to read data and not be stuck all day waiting for the data to load (as it was our case.). If it's a bottleneck for your work, consider diving in! Otherwise, its probably best to just call a library to handle it.

How fast can you go?

Quite fast, depending on your approach. The "simplest" (but by no means trivial) kind of parsing is byte-at-a-time. You look at one character, and depending on what you've seen, decide what to do, and push it into a relevant data structure and keep going. That approach hits a roof of about ( ) according to these calculations by Daniel Lemire, an absolute pro at this sort of stuff. His team's approach let's them parse some formats at a whopping ( ) by using SIMD vectorization (aka, oodles of parallelism, very close to assembly level). We may cover that approach in a later post, but for now we'll be happy with Parsers.jl that does some clever byte-at-a-time optimizations to still get respectable for our needs. This post however will cover some of Julia's multithreading capabilities. This will help us leverage the fact that modern computers aren't getting much faster, but there are more processors we can use.

Oooh - parsing tricks!

We'll do some Memory mapping and using bitmasks. Don't worry, this all comes "out of the box" in Julia.


You can do this in Julia with a simple

julia> using Mmap
julia> mm = Mmap.mmap("file.txt")

Which usually looks like

julia> 0b0001 & 1 # notice the 0b01 is a binary literal

If you're smart about it, you can use those resulting 1s for control branchless control flow, which also helps performance.

The format to be parsed

We'll get start with a EdgeList (called "simple.edgelist") file, they look like this:

6 7
1 2
2 3
6 3
5 6
2 5
2 4
4 1

Where the first two digits correspond to n edges and m vertices, and the file has m lines.

julia> using Mmap, Parsers

julia> const opts = Parsers.Options(delim = ' ', wh1 = 0x00)

julia> function read_two(file)
            # 1. MMap the file 
            io = Mmap.mmap(file)

            # 2. call xparse
            pos = 1
            x, code, vpos, vlen, tlen = Parsers.xparse(Int, io, pos, sizeof(io), opts)

            # 3. Advance the cursor by the 'tab length'
            pos += tlen
            y, code, vpos, vlen, tlen = Parsers.xparse(Int, io, pos, sizeof(io), opts)
            x, y
            # 4. Oops! skipped error handling!

This is one of the "easier" cases that Parsers.jl can handle, so let's break down read_two("simple.edgelist"):

julia> read_two("simple.edgelist", EdgeListFormat) == (6, 7)
TODO ???
  1. Setup the appropriate Parsers.Options struct, which will dispatch on the parsing behavior. We're basically wrapping up all the presets for this file that are important: our delimiter is the ' ' whitespace character and the wh1 == ???

  2. We use the Mmap trick to convert our file into a stream of bytes to get a performance boost.

  3. The argumnents to xparse tell you where and what to start parsing Parsers.xparse(Int, io, pos, sizeof(io), opts) :

    • Int - Type of the value we are parsing

    • io - handler to the io stream

    • pos - to know where to start parsing

    • sizeof(io) - to know if you have reached the end of io

    • opts - to handle our parsing options.

  4. We get a bunch of variables back from xparse - the docstrings here are quite overwhelming but it's not too bad once we stare at it for a while:

    • x is the value we parsed, 6 in this case.

    • code is our bitmask to get back what actually happened in the parsing – particularly useful for error handling. This is will be done through bit operations

    • vpos, vlen, tlen where exactly our "parsing cursor" is at. This book keeping to know where to start parsing next.

  5. TODO Error handling

So that's our basic building block for how to get started with Parsers.jl.

Now let's try and write the rest of the code to read a whole file that contains a single graph:

julia> function load_graph(file_name, ::Type{EdgeListFormat}) # The funky second argument is so that we filter for EdgeListFormat behavior, not important here.
    # yay code

julia> g = load_graph("simple.edgelist", EdgeListFormat)
{6, 7} undirected Int64 simple graph
Success! 🔥

Now, let's make it multithreaded with a test file of 100k rows, to see that we have something worthwhile.

We write our file like this:

julia> open("large.txt", "w+") do f
	for i in 100_000
		println(rand(1:100_000), ' ', rand(1:100_00))