~ 20 minutes.
This post talks about parsing, aka how to read file. It is focused on formats of interest for scientific / data processing, but is meant more as an introduction than a complete overview.
We will use Julia, Parsing.jl, and will be trying to parse the DOT and EdgeList formats.
This post draws from my experience rewriting a graph formats parsing library, LightGraphsIO, and is intended to be documentation / a tutorial for others / Parsers.jl who may venture into similar waters.
Special shout out to the LightGraphs.jl contributors and to Jacob Quinn, who kindly answered my questions about his Parsers.jl library.
For us, parsing involves reading data from a file. They usually have a specification for how the file should be structured.
Some programming tasks will require you to read data and not be stuck all day waiting for the data to load (as it was our case.). If it's a bottleneck for your work, consider diving in! Otherwise, its probably best to just call a library to handle it.
Quite fast, depending on your approach. The "simplest" (but by no means trivial) kind of parsing is byte-at-a-time. You look at one character, and depending on what you've seen, decide what to do, and push it into a relevant data structure and keep going. That approach hits a roof of about ( ) according to these calculations by Daniel Lemire, an absolute pro at this sort of stuff. His team's approach let's them parse some formats at a whopping ( ) by using SIMD vectorization (aka, oodles of parallelism, very close to assembly level). We may cover that approach in a later post, but for now we'll be happy with Parsers.jl that does some clever byte-at-a-time optimizations to still get respectable for our needs. This post however will cover some of Julia's multithreading capabilities. This will help us leverage the fact that modern computers aren't getting much faster, but there are more processors we can use.
We'll do some Memory mapping and using bitmasks. Don't worry, this all comes "out of the box" in Julia.
Briefly:
memory mapping : Convert a file to a vector of bytes. This will ideally greatly improved the processing speed of your file because we are using byte operations instead of string operations.
You can do this in Julia with a simple
julia> using Mmap
julia> mm = Mmap.mmap("file.txt")
bit masks : Instead of using several variables for many binary conditions, we stick them all in a single memory chunk and use bit operations to get out the useful information.
Which usually looks like
julia> 0b0001 & 1 # notice the 0b01 is a binary literal
1
If you're smart about it, you can use those resulting 1s for control branchless control flow, which also helps performance.
We'll get start with a EdgeList
(called "simple.edgelist"
) file, they look like this:
6 7
1 2
2 3
6 3
5 6
2 5
2 4
4 1
Where the first two digits correspond to n
edges and m
vertices, and the file has m
lines.
julia> using Mmap, Parsers
julia> const opts = Parsers.Options(delim = ' ', wh1 = 0x00)
julia> function read_two(file)
# 1. MMap the file
io = Mmap.mmap(file)
# 2. call xparse
pos = 1
x, code, vpos, vlen, tlen = Parsers.xparse(Int, io, pos, sizeof(io), opts)
# 3. Advance the cursor by the 'tab length'
pos += tlen
y, code, vpos, vlen, tlen = Parsers.xparse(Int, io, pos, sizeof(io), opts)
x, y
# 4. Oops! skipped error handling!
end
This is one of the "easier" cases that Parsers.jl can handle, so let's break down read_two("simple.edgelist")
:
julia> read_two("simple.edgelist", EdgeListFormat) == (6, 7)
true
Setup the appropriate Parsers.Options struct, which will dispatch on the parsing behavior. We're basically wrapping up all the presets for this file that are important: our delimiter is the ' '
whitespace character and the wh1
== ???
We use the Mmap
trick to convert our file into a stream of bytes to get a performance boost.
The argumnents to xparse
tell you where and what to start parsing Parsers.xparse(Int, io, pos, sizeof(io), opts)
:
Int
- Type of the value we are parsing
io
- handler to the io stream
pos
- to know where to start parsing
sizeof(io)
- to know if you have reached the end of io
opts
- to handle our parsing options.
We get a bunch of variables back from xparse
- the docstrings here are quite overwhelming but it's not too bad once we stare at it for a while:
x
is the value we parsed, 6
in this case.
code
is our bitmask to get back what actually happened in the parsing – particularly useful for error handling. This is will be done through bit operations
vpos, vlen, tlen
where exactly our "parsing cursor" is at. This book keeping to know where to start parsing next.
TODO Error handling
So that's our basic building block for how to get started with Parsers.jl.
Now let's try and write the rest of the code to read a whole file that contains a single graph:
julia> function load_graph(file_name, ::Type{EdgeListFormat}) # The funky second argument is so that we filter for EdgeListFormat behavior, not important here.
# yay code
end
julia> g = load_graph("simple.edgelist", EdgeListFormat)
{6, 7} undirected Int64 simple graph
Now, let's make it multithreaded with a test file of 100k rows, to see that we have something worthwhile.
We write our file like this:
julia> open("large.txt", "w+") do f
for i in 100_000
println(rand(1:100_000), ' ', rand(1:100_00))
end
end