Validation after uploading a JSON Lines file

3 min readSep 16, 2020

JSON Lines is a convenient format for storing structured data that may be processed one record at a time. It essentially consists of several lines where each individual line is a valid JSON object, separated by newline character \n.

Example of JSON Lines format:

[
   {"class":1,"teacher":"Mark","student":["Tom"]},
   {"class":2,"teacher":"John","student":["Jessi","Antony","Jack"]},
   {"class":3,"teacher":"Bob","student":["Jerry","Karol"]},
]

Every entry in JSON Lines is a valid JSON. It can be parsed/unmarshaled as a standalone JSON document.

Advantage of JSON Lines:

No need to read the entire file in memory before parse.
Easily add further lines to the file by simply appending to the file.

3 requirements of JSON Lines:

UTF-8 Encoding
Each Line is a Valid JSON Value
Line Separator is \n

What kinds of validation is required after uploading a JSON Lines file?

Check if the uploaded file with file extension .jsonl
Check if each line is not exceeded 2MB (my project requirement)
Check if each line of data is obeyed with JSON schema provided (my project requirement)
Check if UTF-8 Encoding supported
Check if support any length of file (my project requirement)

Progress of validation

Read the uploaded file

↓

Define parameters
- cursor = 0 to represent the beginning of each chunk
- max_line_size = 2*2**20to represent the max size of each line (1 MB = 2^20 bytes in binary)
- chunk_size = 3*2**20 to represent the size of data be loaded each time
- cursor_end = cursor + chunk_size to represent the end of each chunk
- is_last_chunk = cursor_end > file.end to represent if reaching end of file
- chunk = file.slice(cursor, cursor_end) to represent one chunk of file data

↓

If not last chunk (meaning last character may not be an intact UTF8 character), remove the last byte of the file data

UTF-8 is defined to encode code points in one to four bytes. The x characters are replaced by the bits of the code point.

↓

Split file data in lines array by \n
lines = file_text.split('\n')

If last line is without \n , it’s not a completed line
completed_line = is_last_chunk ? lines : lines.slice(0, -1)

↓

Read first line of first chunk

↓

Validate if each line is followed the rule of JSON schema provided
Take the above Example of JSON Lines format as example: each line must contain the key of “class”, “teacher” or “student”

Use Ajv as JSON Schema Validator

Show the error on UI of which line is violated the rule and what the error is

↓

While validating, collect these validation errors

If these validation errors are exceeded the max_n_json_error (let’s say 500), abort the reading. This is essential when reading very large of file.

↓

If no validation error this line, read next line

↓

If no validation error for these lines in this chunk, read next chunk

↓

If every chunk of this file is done with validation and no errors, execute next action

Summary of the progress

file uploaded → start validate → read each chunk → validate each line → validation done → next action

References:

JSON Lines

This page describes the JSON Lines text format, also called newline-delimited JSON. JSON Lines is a convenient format…

jsonlines.org

How to Love jsonl — using JSON Line Format in your Workflow

Why you should think about using JSON Line format in your data processing workflow. We’ll look at some jsonl examples…

medium.com

jsonlines - jsonlines documentation

is a Python library to simplify working with jsonlines and ndjson data. This data format is straight-forward: it is…

jsonlines.readthedocs.io

jsonlines - jsonlines documentation

is a Python library to simplify working with jsonlines and ndjson data. This data format is straight-forward: it is…

jsonlines.readthedocs.io

UTF-8

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the…

en.wikipedia.org

ajv-validator/ajv

The fastest JSON Schema validator for Node.js and browser. Supports draft-04/06/07. Ajv has been awarded a grant from…

github.com

FileReader

The FileReader object lets web applications asynchronously read the contents of files (or raw data buffers) stored on…

developer.mozilla.org

Validation after uploading a JSON Lines file

Advantage of JSON Lines:

3 requirements of JSON Lines:

What kinds of validation is required after uploading a JSON Lines file?

Progress of validation

Summary of the progress

References:

JSON Lines

This page describes the JSON Lines text format, also called newline-delimited JSON. JSON Lines is a convenient format…

How to Love jsonl — using JSON Line Format in your Workflow

Why you should think about using JSON Line format in your data processing workflow. We’ll look at some jsonl examples…

jsonlines - jsonlines documentation

is a Python library to simplify working with jsonlines and ndjson data. This data format is straight-forward: it is…

jsonlines - jsonlines documentation

is a Python library to simplify working with jsonlines and ndjson data. This data format is straight-forward: it is…

UTF-8

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the…

ajv-validator/ajv

The fastest JSON Schema validator for Node.js and browser. Supports draft-04/06/07. Ajv has been awarded a grant from…

FileReader

The FileReader object lets web applications asynchronously read the contents of files (or raw data buffers) stored on…

Written by CH-K