Validation after uploading a JSON Lines file

CH-K
3 min readSep 16, 2020
Photo by Mika Baumeister on Unsplash

JSON Lines is a convenient format for storing structured data that may be processed one record at a time. It essentially consists of several lines where each individual line is a valid JSON object, separated by newline character \n.

Example of JSON Lines format:

[
{"class":1,"teacher":"Mark","student":["Tom"]},
{"class":2,"teacher":"John","student":["Jessi","Antony","Jack"]},
{"class":3,"teacher":"Bob","student":["Jerry","Karol"]},
]

Every entry in JSON Lines is a valid JSON. It can be parsed/unmarshaled as a standalone JSON document.

Advantage of JSON Lines:

  1. No need to read the entire file in memory before parse.
  2. Easily add further lines to the file by simply appending to the file.

3 requirements of JSON Lines:

  1. UTF-8 Encoding
  2. Each Line is a Valid JSON Value
  3. Line Separator is \n

What kinds of validation is required after uploading a JSON Lines file?

  1. Check if the uploaded file with file extension .jsonl
  2. Check if each line is not exceeded 2MB (my project requirement)
  3. Check if each line of data is obeyed with JSON schema provided (my project requirement)
  4. Check if UTF-8 Encoding supported
  5. Check if support any length of file (my project requirement)

Progress of validation

Read the uploaded file

Define parameters
- cursor = 0 to represent the beginning of each chunk
- max_line_size = 2*2**20to represent the max size of each line (1 MB = 2^20 bytes in binary)
- chunk_size = 3*2**20 to represent the size of data be loaded each time
- cursor_end = cursor + chunk_size to represent the end of each chunk
- is_last_chunk = cursor_end > file.end to represent if reaching end of file
- chunk = file.slice(cursor, cursor_end) to represent one chunk of file data

If not last chunk (meaning last character may not be an intact UTF8 character), remove the last byte of the file data

UTF-8 is defined to encode code points in one to four bytes. The x characters are replaced by the bits of the code point.

the structure of the encoding

Split file data in lines array by \n
lines = file_text.split('\n')

If last line is without \n , it’s not a completed line
completed_line = is_last_chunk ? lines : lines.slice(0, -1)

Read first line of first chunk

Validate if each line is followed the rule of JSON schema provided
Take the above Example of JSON Lines format as example: each line must contain the key of “class”, “teacher” or “student”

Use Ajv as JSON Schema Validator

Show the error on UI of which line is violated the rule and what the error is

While validating, collect these validation errors

If these validation errors are exceeded the max_n_json_error (let’s say 500), abort the reading. This is essential when reading very large of file.

If no validation error this line, read next line

If no validation error for these lines in this chunk, read next chunk

If every chunk of this file is done with validation and no errors, execute next action

Summary of the progress

file uploaded → start validate → read each chunk → validate each line → validation done → next action

--

--