Visualization and Debugging#
We have introduced a new utility called TokenFlow, written in Flask, HTML/CSS, and JavaScript. This tool visualizes preprocessed data efficiently and in an organized fashion, allowing for easy debugging and error-catching of the output data. It can handle both text-only and multimodal datasets, visualizing images if available to verify dataset integrity.
How To Use?#
python3 launch_tokenflow.py --output_dir <directory/of/file(s)>
Arguments#
output_dir
: Contains the file(s) that are to be viewed in the GUI. [Required]data_params:
Location of thedata_params.json
file for the preprocessed dataset. [Optional]port:
In case the user wants to specify a different port for the flask server. [Optional, default=5000]
It is assumed that data_params.json
is present in the same directory as output_dir
.
If not, we expect it to be passed using --data_params </location/of/data_params.json>
.
All the .h5
files pulled from the output_dir
are displayed in the dropdown, which
are loaded when clicked. For each loaded file, all the available sequences are present in
the second dropdown, which are also loaded when clicked.
Visualized Information#
Given the dataset, TokenFlow displays the associated metadata, present in data_params.json
,
in the left column of the page. For each .h5
file, TokenFlow also calculates the Top 5 tokens
that occur most frequently accross all sequences in that file.
On the right column, there are 4 sections. input_strings
and label_strings
are converted tokens from input_ids
and labels
respectively. The tokens
in the string sections are highlighted in green when Loss weight is greater than zero for that specific
token. Similarly, the tokens are highlighted in red when their Attention mask is set to zero.
Hovering over any token in all 4 sections highlights their corresponding token in all the other sections. This helps in checking if the mapping of an id to token, or an input to its label is correct.
Additionally, the hover opens up a popup. which displays additional information like the Position IDs and the token’s idx in MSL. Also, in the case of a multimodal dataset, hovering over the image pad tokens also displays the corresponding image in the popup itself.