tokenizers.pre_tokenizers.Whitespace
#huggingface/tokenizers
https://huggingface.co/docs/tokenizers/api/pre-tokenizers#tokenizers.pre_tokenizers.Whitespace
This pre-tokenizer simply splits using the following regex:
\w+|[^\w\s]+