Big

lecture 08
NLP101
+
Embedding
(word2vec as an example)
+
a lil workshop
(language as an interface with generative AI)

after today's lecture:
-- what is (not) NLP
-- solving NLP basic tasks with Apple Natural Language Framework
-- intuition about embedding
-- using our language to interact with generative AI

NLP?
-- look it up on wikipedia
-- there is no hard definition but it is (almost) everything about human language
-- it is an interdisciplinary subfield

Example applications of NLP:
text-to-speech 🗣️
speech-to-text 👂
machine translation 🧠
image captioning 🧑‍🏫
text-to-image generation 🧑‍🎨
etc.

also NLP:
-- Lemmatization
-- Named Entity Recognition
-- Part-of-speech tagging
etc.
❓❓❓

engaging with languages is very natural to us,
(it is a given, we can use it without fully understanding how our language system works)

For today's lecture, we go through each one of these basic tasks with a reference to

🍎Apple Natural Language Framework🍎 solutions
There is no need to understand the model, just to know how to use it👐

open an xcode playground, import the framework:

import NaturalLanguage

import Foundation

import CoreML

Language identification
--1. what is it about? 🥷
--- try answer by filling out blanks in: Given an input of __, the solution model should produce an output of __
--2. what are the possible use cases? 🧑‍🍳
--3. paste and run the example codes! 🕹️

Named Entity Recognition
--1. what is it about? 🥷
--- try answer by filling out blanks in: Given an input of __, the solution model should produce an output of __
--2. what are the possible use cases? 🧑‍🍳
--3. paste and run the example codes! 🕹️

Part-of-speech tagging
--1. what is it about? 🥷
--- try answer by filling out blanks in: Given an input of __, the solution model should produce an output of __
--2. what are the possible use cases? 🧑‍🍳
--3. paste and run the example codes! 🕹️

Tokenization
--1. what is it about? 🥷
--- try answer by filling out blanks in: Given an input of __, the solution model should produce an output of __
--2. what are the possible use cases? 🧑‍🍳
--3. paste and run the example codes! 🕹️

recap: one hot encoding

-- how to encode today being Thursday (day-of-the-week)?
-- and what is the size of the vector?

“I've come up with a set of rules that describe our reactions to technologies:
1. Anything that is in the world when you’re born is normal and ordinary and is just a natural part of the way the world works.

2. Anything that's invented between when you’re fifteen and thirty-five is new and exciting and revolutionary and you can probably get a career in it.

3. Anything invented after you're thirty-five is against the natural order of things.”

how to encode(numberify) the first sentence?
Anything that is in the world when you’re born is normal and ordinary and is just a natural part of the way the world works.

step 2
one-hot encoding:
assign a vector to each unique word (example on the board )

what is the size of each one-hot vector?
-- recall what is the size of the one-hot vector of today being thursday?

some issues of such numberification scheme :
1. vector size is large
(more memory requried and higher computation costs)

some issues of such numberification scheme :
2. [IMPORTANT] relational information between words is lost:
-- distance between different word pairs are the same
-- but we know that in semantics, some words are similar/closer
e.g. "normal" is closer with "ordinary" than it is with "born"

me on the whiteboard:
calculate the distance between one-hot vector of
-- "normal" and "ordinary" than it is with "born"
-- "normal" and "born"

same distances!
-- our semantic "relational information" is not reflected in the one-hot encoding scheme!

word2vec (an AI model type) for the rescue
it has an ingenious training target:
-- given a word, predict its surrounding words (the context)
-- given surrounding words (the context), predict the centre word

advantages:

-- smaller embedding vector size (pre-defined before training)
-- relations are preserved

an implication:
compression brings abstraction
because it has to discover and use "relations" to save some memory space
(and abstraction seems to be crucial for intelligence)

making your own word2vec equation:
-- make a hypothesis on analogous words
-- try verifying it (using the notebook)

lil workshop
- Language as mediator:
-- interact with text-to-image and text-to-audio generative AIs using the same/similar text prompts

lil workshop
Language as mediator:
-- use the same/similar prompt to generate a piece of image and audio
-- select any text-to-X model
-- have funnn 🤪
-- optional: combine the linked image and audio using imovie

today we talked about:

-- introduction to NLP 🎃
-- some basic NLP tasks solved with Apple NL framework
--- Language identification
--- Lemmatization
--- Named Entity Recognition
--- Part-of-speech tagging
--- Tokennization

today we talked about:

-- intuition about embedding (how to numberify words) 🧚
--- smaller embedding vector size
--- preserved relational information
--- word2vec as an example

today we talked about:

-- using language to connect text-to-X models of different modalities 🌉