NLP Demystified 2: Text Tokenization

NLP Demystified 2: Text Tokenization

HomeFuture MojoNLP Demystified 2: Text Tokenization
NLP Demystified 2: Text Tokenization
ChannelPublish DateThumbnail & View CountDownload Video
Channel AvatarPublish Date not found Thumbnail
0 Views
Course playlist: https://www.youtube.com/playlist?list=PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS

The usual first step in NLP is to break our documents into smaller pieces in a process called tokenization. We will look at the challenges involved and how we can make it happen.

Colab notebook: https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_preprocessing.ipynb

Timestamps:
00:00 Tokenization
00:12 Text as unstructured data
00:39 What is tokenization?
01:09 The challenges of tokenization
03:09 DEMO: tokenizing text with spaCy
07:55 Preprocessing as a pipeline

This video is part of Natural Language Processing Demystified – a free, accessible course on NLP.

Visit https://www.nlpdemystified.org/ for more information.

Please take the opportunity to connect and share this video with your friends and family if you find it useful.