e12e 18 hours ago

Nice!

> However, for reasons unknown to me, they wrap these neatly separated rows with brackets ([ and ]) and add a comma to each line

Well, the reason (misguided or not) is as you say, I imagine:

> so it’s a valid, JSON array containing 100+ million items.

> We are not going to attempt to load a this massive array. Instead, we’re running this command:

    zcat ../latest-all.json.gz | sed 's/,$//' | split -l 100000 - wd_items_cw --filter='gzip > $FILE.gz'
That's one approach - I'm always a little wary of treating a rich format like JSON as <something> deliminated text - I'd be curious if using jq in streaming mode is much different in run-time. I believe this snippet, the core of which we lifted from stack overflow or somewhere does the same thing; split a valid JSON array into ndjson (with tweaks to hopefully generate similar splits:

    gunzip -c ../latest-all.json.gz \
     | jq -cn --stream \
       'fromstream(inputs|(.[0]  |= .[1:]) | select(. != [[]]) )' \
   | split -l 100000 - wd_items_cw --filter='gzip > $FILE.gz
Note on MacOS zcat might not be gunzip, hence the change.
  • fiddlerwoaroof 14 hours ago

    It's too bad there aren't more streaming JSON parsers like oboe.js[1]. It would be nice if parsing libraries always supplied an event-based approach like this in addition to parsers that build up the entire data structure in memory.

    [1]: https://github.com/jimhigson/oboe.js

    EDIT: looking around a bit, I found json-stream ( https://github.com/dgraham/json-stream ) for Ruby.

    • e12e 4 hours ago

      I recently looked a little at streaming large JSON files in ruby - but ran into some problems trying to stream from and to gzipped files via layering ruby io-objects. In theory it should just be stacking streams, but in practice it was convoluted, a little brittle and quite slow.

Alifatisk a day ago

Interesting article, I think this is the first time I've seen someone pick Ractors over Parallel gem, cool!

I love seeing these quick and dirty Ruby scripts used for data processing / filtering or whatever, this is what it is good at!

  • dbreunig a day ago

    Thanks! This is a near perfect use case for Ractors since we chunked all the files and there’s no need for the file processing function to share any context.

nighthawk454 19 hours ago

Hey cool article, thanks! Might be time to finally dive in to DuckDB

  • gnulinux 2 hours ago

    Duckdb is amazing, I've been using it in the last few weeks to analyze data I generate with datalog/souffle and I was completely blown away by the performance and QOL features. I seriously don't understand how it can be this fast...