Wikidata as a Giant Crosswalk File

e12e 9 months ago

Nice!

> However, for reasons unknown to me, they wrap these neatly separated rows with brackets ([ and ]) and add a comma to each line

Well, the reason (misguided or not) is as you say, I imagine:

> so it’s a valid, JSON array containing 100+ million items.

> We are not going to attempt to load a this massive array. Instead, we’re running this command:

    zcat ../latest-all.json.gz | sed 's/,$//' | split -l 100000 - wd_items_cw --filter='gzip > $FILE.gz'

That's one approach - I'm always a little wary of treating a rich format like JSON as <something> deliminated text - I'd be curious if using jq in streaming mode is much different in run-time. I believe this snippet, the core of which we lifted from stack overflow or somewhere does the same thing; split a valid JSON array into ndjson (with tweaks to hopefully generate similar splits:

    gunzip -c ../latest-all.json.gz \
     | jq -cn --stream \
       'fromstream(inputs|(.[0]  |= .[1:]) | select(. != [[]]) )' \
   | split -l 100000 - wd_items_cw --filter='gzip > $FILE.gz

Note on MacOS zcat might not be gunzip, hence the change.

fiddlerwoaroof 9 months ago

It's too bad there aren't more streaming JSON parsers like oboe.js[1]. It would be nice if parsing libraries always supplied an event-based approach like this in addition to parsers that build up the entire data structure in memory.
[1]: https://github.com/jimhigson/oboe.js
EDIT: looking around a bit, I found json-stream ( https://github.com/dgraham/json-stream ) for Ruby.
- e12e 9 months ago
  
  I recently looked a little at streaming large JSON files in ruby - but ran into some problems trying to stream from and to gzipped files via layering ruby io-objects. In theory it should just be stacking streams, but in practice it was convoluted, a little brittle and quite slow.
- Alifatisk 9 months ago
  
  There's even more alternatives at the bottom https://github.com/dgraham/json-stream?tab=readme-ov-file#al...

Alifatisk 9 months ago

Interesting article, I think this is the first time I've seen someone pick Ractors over Parallel gem, cool!

I love seeing these quick and dirty Ruby scripts used for data processing / filtering or whatever, this is what it is good at!

dbreunig 9 months ago

Thanks! This is a near perfect use case for Ractors since we chunked all the files and there’s no need for the file processing function to share any context.

ZeroGravitas 9 months ago

There's a cool tool for adding these ID connections to Wikidata:

https://mix-n-match.toolforge.org/

It let's you review existing potential matches as well as upload CSVs or generate regex web scrapers to ingest database IDs to be linked against others.

Here some related to places, since the article is geographical:

https://mix-n-match.toolforge.org/#/group/ig_authority_contr...

But it's got all sorts of people places and things, anything that someone might have built a catalog or list of.

nighthawk454 9 months ago

Hey cool article, thanks! Might be time to finally dive in to DuckDB

gnulinux 9 months ago

Duckdb is amazing, I've been using it in the last few weeks to analyze data I generate with datalog/souffle and I was completely blown away by the performance and QOL features. I seriously don't understand how it can be this fast...