JSON Lines is a convenient format for storing structured data that may be processed one record at a time. It essentially consists of several lines where each individual line is a valid JSON object, separated by newline character \n
.
Example of JSON Lines format:
[
{"class":1,"teacher":"Mark","student":["Tom"]},
{"class":2,"teacher":"John","student":["Jessi","Antony","Jack"]},
{"class":3,"teacher":"Bob","student":["Jerry","Karol"]},
]
Every entry in JSON Lines is a valid JSON. It can be parsed/unmarshaled as a standalone JSON document.
Advantage of JSON Lines:
- No need to read the entire file in memory before parse.
- Easily add further lines to the file by simply appending to the file.
3 requirements of JSON Lines:
- UTF-8 Encoding
- Each Line is a Valid JSON Value
- Line Separator is
\n
What kinds of validation is required after uploading a JSON Lines file?
- Check if the uploaded file with file extension
.jsonl
- Check if each line is not exceeded 2MB (my project requirement)
- Check if each line of data is obeyed with JSON schema provided (my project requirement)
- Check if UTF-8 Encoding supported
- Check if support any length of file (my project requirement)
Progress of validation
Read the uploaded file
↓
Define parameters
- cursor = 0
to represent the beginning of each chunk
- max_line_size = 2*2**20
to represent the max size of each line (1 MB = 2^20 bytes in binary)
- chunk_size = 3*2**20
to represent the size of data be loaded each time
- cursor_end = cursor + chunk_size
to represent the end of each chunk
- is_last_chunk = cursor_end > file.end
to represent if reaching end of file
- chunk = file.slice(cursor, cursor_end)
to represent one chunk of file data
↓
If not last chunk (meaning last character may not be an intact UTF8 character), remove the last byte of the file data
UTF-8 is defined to encode code points in one to four bytes. The x characters are replaced by the bits of the code point.
↓
Split file data in lines
array by \n
lines = file_text.split('\n')
If last line is without \n
, it’s not a completed linecompleted_line = is_last_chunk ? lines : lines.slice(0, -1)
↓
Read first line of first chunk
↓
Validate if each line is followed the rule of JSON schema provided
Take the above Example of JSON Lines format as example: each line must contain the key of “class”, “teacher” or “student”
Use Ajv as JSON Schema Validator
Show the error on UI of which line is violated the rule and what the error is
↓
While validating, collect these validation errors
If these validation errors are exceeded the max_n_json_error (let’s say 500), abort the reading. This is essential when reading very large of file.
↓
If no validation error this line, read next line
↓
If no validation error for these lines in this chunk, read next chunk
↓
If every chunk of this file is done with validation and no errors, execute next action
Summary of the progress
file uploaded → start validate → read each chunk → validate each line → validation done → next action