How I parsed huge JSON files into an SQLite Database in under a second using the power of Crystal Language

Recently I worked on a project with my friend David Colombo, he needed to take huge JSON files, map them onto an object, and then copy the data over from that object and insert it into a SQLite Database. This post describes the technical challenge of taking large amounts of JSON Data and inserting it into a SQLite Database without wait times.

Background

Previously, David had created a parser in NodeJS that reads through the JSON files containing structured data and inserts the data into SQLite. These JSON Files contained 1,000s of keys, and are 100s of megabytes each. Unfortunately due to limitations of the language and some anti-practices that JavaScript allows, the script took 14 hours to complete it’s task. He attempted to write a parser in Python, except it was limited to 9999 keys and couldn’t meet the project’s expectations (in retrospect I’m sure there’s a library or way to get around that limit, I’m not a Python Developer and cannot comment much on this potential limitation). He sent me a message asking if I was still working with Ruby and asked if it’d be any faster, I explained while I’m able to write a script in Ruby just fine, I had been experimenting with a language called Crystal (a language with Ruby-like syntax and C-like performance) and asked if he’d be open to trying it instead of Ruby. I also wasn’t able to predict the performance of Ruby ahead of time and was not prepared to provide an answer on whether Ruby would be faster. Since I’ve been learning Crystal I decided to give the rewrite in it a shot.

A few requirements

We want the code to be long lived and require minimal changes (preferably no changes) to the project’s dependencies, and we also wanted to avoid using third party libraries (since we risk having to replace them in the event the maintainers end the project) and only using Crystal’s standard library was a goal. The standard library does not support databases, so we decided to use the official shard for SQLite, it’s maintained by the Crystal Core Team so it will probably be maintained well enough.

An object to represent the JSON Files

To maximize performance I decided to use a nested struct to contain the data using Crystal’s JSON.mapping() (https://crystal-lang.org/api/0.28.0/JSON.html#mapping). The data wouldn’t be changed, just copied into SQLite and stack memory is cheaper than heap memory so the drawbacks of structs were worth it in exchange for performance benefits.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
module MyProgram
extend self

struct CVE\_Data\_Entity
struct CVE_Items
struct CVE
struct DataMeta
JSON.mapping(
"id": {key: "ID", type: String, nilable: true},
"assigner": {key: "ASSIGNER", type: String, nilable: true}
)
end

struct Affects
struct Vendor
struct VendorData
struct Product
struct Data
struct Version
struct Data
JSON.mapping(
"version_value": {type: String, nilable: true},
"version_affected": {type: String, nilable: true},
)
end

JSON.mapping(
"version_data": {type: Array(Data), nilable: true},
)
end

JSON.mapping(
"product_name": {type: String, nilable: true},
"version": {type: Version, nilable: true}
)
end

JSON.mapping(
"product_data": {type: Array(Data), nilable: true},
)
end

JSON.mapping(
"vendor_name": {type: String, nilable: true},
"product": {type: Product, nilable: true},
)
end

JSON.mapping(
"vendor_data": {type: Array(VendorData), nilable: true},
)
end

JSON.mapping(
"vendor": {type: Vendor, nilable: true},
)
end

struct Problemtype
struct Data
struct Description
JSON.mapping(
"lang": {type: String, nilable: true},
"value": {type: String, nilable: true}
)
end

JSON.mapping(
"description": {type: Array(Description), nilable: true}
)
end

JSON.mapping(
"problemtype_data": {type: Array(Data), nilable: true},
)
end

struct References
struct Data
JSON.mapping(
"url": {type: String, nilable: true},
"name": {type: String, nilable: true},
"refsource": {type: String, nilable: true},
"tags": {type: Array(String), nilable: true},
)
end

JSON.mapping(
"reference_data": {type: Array(Data), nilable: true},
)
end

struct Description
struct Data
JSON.mapping(
"lang": {type: String, nilable: true},
"value": {type: String, nilable: true},
)
end

JSON.mapping(
"description_data": {type: Array(Data), nilable: true}
)
end

JSON.mapping(
"data_type": {type: String, nilable: true},
"data_format": {type: String, nilable: true},
"data_version": {type: String, nilable: true},
"cve\_data\_meta": {key: "CVE\_data\_meta", type: DataMeta, nilable: true},
"affects": {type: Affects, nilable: true},
"problemtype": {type: Problemtype, nilable: true},
"references": {type: References, nilable: true},
"description": {type: Description, nilable: true},
)
end

struct Configurations
struct Nodes
struct CPE
JSON.mapping(
"vulnerable": {type: Bool, nilable: true},
"cpe23Uri": {type: String, nilable: true},
)
end

JSON.mapping(
"operator": {type: String, nilable: true},
"cpe_match": {type: Array(CPE), nilable: true},
)
end

JSON.mapping(
"cve\_data\_version": {key: "CVE\_data\_version", type: String, nilable: true},
"nodes": {type: Array(Nodes), nilable: true},
)
end

struct Impact
struct BaseMetricV3
struct CvssV3
JSON.mapping(
"version": {type: String, nilable: true},
"vectorString": {type: String, nilable: true},
"attackVector": {type: String, nilable: true},
"attackComplexity": {type: String, nilable: true},
"privilegesRequired": {type: String, nilable: true},
"userInteraction": {type: String, nilable: true},
"scope": {type: String, nilable: true},
"confidentialityImpact": {type: String, nilable: true},
"integrityImpact": {type: String, nilable: true},
"availabilityImpact": {type: String, nilable: true},
"baseScore": {type: Float64, nilable: true},
"baseSeverity": {type: String, nilable: true},
)
end

JSON.mapping(
"cvssV3": {type: CvssV3, nilable: true},
"exploitabilityScore": {type: Float64, nilable: true},
"impactScore": {type: Float64, nilable: true},
)
end

struct BaseMetricV2
struct CvssV2
JSON.mapping(
"version": {type: String, nilable: true},
"vectorString": {type: String, nilable: true},
"accessVector": {type: String, nilable: true},
"accessComplexity": {type: String, nilable: true},
"authentication": {type: String, nilable: true},
"confidentialityImpact": {type: String, nilable: true},
"integrityImpact": {type: String, nilable: true},
"availabilityImpact": {type: String, nilable: true},
"baseScore": {type: Float64, nilable: true},
)
end

JSON.mapping(
"cvssV2": {type: CvssV2, nilable: true},
"severity": {type: String, nilable: true},
"exploitabilityScore": {type: Float64, nilable: true},
"impactScore": {type: Float64, nilable: true},
"acInsufInfo": {type: Bool, nilable: true},
"obtainAllPrivilege": {type: Bool, nilable: true},
"obtainUserPrivilege": {type: Bool, nilable: true},
"obtainOtherPrivilege": {type: Bool, nilable: true},
"userInteractionRequired": {type: Bool, nilable: true},
)
end

JSON.mapping(
"baseMetricV2": {type: BaseMetricV2, nilable: true},
"baseMetricV3": {type: BaseMetricV3, nilable: true},
)
end

JSON.mapping(
"cve": {type: CVE, nilable: true},
"configurations": {type: Configurations, nilable: true},
"impact": {type: Impact, nilable: true},
"publishedDate": {type: String, nilable: true},
"lastModifiedDate": {type: String, nilable: true},
)
end

JSON.mapping(
"cve\_data\_type": {key: "CVE\_data\_type", type: String, nilable: true},
"cve\_data\_format": {key: "CVE\_data\_format", type: String, nilable: true},
"cve\_data\_version": {key: "CVE\_data\_version", type: String, nilable: true},
"cve\_data\_numberofcves": {key: "CVE\_data\_numberOfCVEs", type: String, nilable: true},
"cve\_data\_timestamp": {key: "CVE\_data\_timestamp", type: String, nilable: true},
"cve\_items": {key: "CVE\_Items", type: Array(CVE_Items), nilable: true},
)
end
end

Improving the JSON parsing time

In Crystal’s development mode parsing one of the JSON files (2018.json containing around 200MB of data) into the object took 30 seconds, in release mode it took 5 seconds. This performance was pretty good already but I’d like it to be faster as the datasets would get larger and larger over time. The first thing I tried was changing the class to a struct (which is being used now). That had minimal impact on performance. Next I changed from using the File.open() method to the File.read() method which improved the file read speed and brought the parse time down to under a second. From this we learned that it’s much faster to open a file in read mode, than in read and write mode. When writing code we know to only ask for read permissions except when we also need to write to it. There are probably more file optimizations we could try, although that’s a topic of it’s own.

Inserting the data into SQLite

Gathering the data and attaching it into an object was only half the challenge, next I needed an efficient way to massively insert data into SQLite. At first I tried iterating over the various arrays and doing a lot of individual queries, on my Mac that still took around five minutes and much longer (never actually completed) on a Linux Laptop. I then learned I could group these queries into one bulk transaction. I came up with the following code that ran in under a second.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
require "json"
require "sqlite3"
require "./cve\_data\_entity.cr"

module MyProgram
VERSION = "0.1.0"

filepath = "./src/example-json-files/example-full-2019-dataset.json"
myobject = CVE\_Data\_Entity.from_json(File.read(filepath))

DB.open "sqlite3://./src/example-json-files/dbname.sqlt" do |db|
db.transaction do |tx|
tx.begin_transaction
myobject.try(&.cve_items).try(&.each do |item|
# Insert General Information into the Database
cve\_item\_id = item.try(&.cve).try(&.cve\_data\_meta).try(&.id) || "Not available"
data\_type = item.try(&.cve).try(&.data\_type) || "Not available"
data\_format = item.try(&.cve).try(&.data\_format) || "Not available"
data\_version = item.try(&.cve).try(&.data\_version) || "Not available"
published_date = item.try(&.publishedDate) || "Not available"
last\_modified\_date = item.try(&.lastModifiedDate) || "Not available"
tx.connection.exec("INSERT INTO GENERAL\_INFORMATION (\\"ID\\", \\"DATA\_TYPE\\", \\"DATA\_FORMAT\\", \\"DATA\_VERSION\\", \\"PUBLISHDATE\\", \\"LASTMODIFIEDDATE\\") VALUES (?, ?, ?, ?, ?, ?)", \[cve\_item\_id, data\_type, data\_format, data\_version, published\_date, last\_modified\_date\])

end)
tx.commit
end
end
end

Admittedly the .try() method calls can be a bit messy and we’re looking into cleaner ways to write this type of code. One recommendation was to port the JSON .dig() method to my struct, in the future I might attempt that. If its difficult reading through the try logic right now, read about capturing blocks and procs in the Crystal Documentation first and it’ll make more sense. This method is not ideal when working with larger amounts of code. Although other than the readability issues, there was not a huge performance impact.

In conclusion

By writing our parser and database insertion logic in Crystal we were forced to use better coding practices, we learned about SQLite transactions, and saved about 14 hours on our database’s build time and brought it down to under a second. If you have a similar challenge in your organization, consider trying to solve it using Crystal.