NLP学习梳理

吴恩达《自然语言处理|natural language processing》Lesson1

Week1

flowchart LR
Preprocess --> b[Building and Visualizing word frequencies] -->c[Run Logistic Regression Model]

Preprocess

Use the Natural Language Toolkit (NLTK) package, an open-source Python library for natural language processing.Use its Twitter dataset that contains 5000 positive tweets and 5000 negative tweets exactly and has been manually annotated.

import nltk                                # Python library for NLP
from nltk.corpus import twitter_samples # sample Twitter dataset from NLTK
import matplotlib.pyplot as plt # library for visualization
import random # pseudo-random number generator
# downloads sample twitter dataset
nltk.download('twitter_samples')

Load dataset:

# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

nltk.corpus.twitter_samples 作为一个特定的语料库加载器,它继承了NLTK语料库读取器的一些通用方法,并提供了一些特定于其数据结构的方法。

以下是 twitter_samples 对象最常用的方法:

  1. fileids():

    • 作用: 返回 twitter_samples 语料库中所有可用文件的ID(即文件名)列表。

    • 示例:

      Python

      from nltk.corpus import twitter_samples
      print(twitter_samples.fileids())
      # 输出可能类似:['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']
  2. strings(fileid):

    • 作用: 从指定的 fileid 文件中提取并返回所有推文的纯文本内容,作为一个字符串列表。这是最常用的方法之一,因为它直接提供用于NLP处理的文本。

    • 参数: fileid (字符串) - 要读取的文件ID,例如 'positive_tweets.json'

    • 示例:

      Python

      from nltk.corpus import twitter_samples
      positive_tweets = twitter_samples.strings('positive_tweets.json')
      print(positive_tweets[0])
  3. tokenized(fileid):

    • 作用: 从指定的 fileid 文件中提取推文,并对每条推文进行分词(tokenize),返回一个列表的列表,其中每个内部列表代表一条推文的单词(或token)。

    • 参数: fileid (字符串) - 要读取的文件ID。

    • 示例:

      Python

      from nltk.corpus import twitter_samples
      tokenized_tweets = twitter_samples.tokenized('positive_tweets.json')
      print(tokenized_tweets[0])
      # 输出可能类似:['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']
  4. docs(fileid):

    • 作用: 从指定的 fileid 文件中读取原始的 JSON 数据,并将其作为 Python 字典对象的生成器返回。每个字典代表一条推文的完整JSON结构。

    • 参数: fileid (字符串) - 要读取的文件ID。

    • 示例:

      Python

      from nltk.corpus import twitter_samples
      import json # 用于美化打印

      raw_docs_generator = twitter_samples.docs('positive_tweets.json')
      first_doc = next(raw_docs_generator)
      print(json.dumps(first_doc, indent=2, ensure_ascii=False))

总结

twitter_samples 主要提供了这三个核心方法来访问其内置的Twitter数据:

  • fileids(): 查看有哪些数据文件可用。
  • strings(): 获取纯文本推文列表(最常见用于直接的文本处理)。
  • tokenized(): 获取已分词的推文列表(方便进行词级别的分析)。
  • docs(): 获取原始的JSON字典列表(如果你需要推文的元数据,如ID、用户、时间等)。

这些方法使得用户可以方便地加载和使用Twitter数据集,特别是在情感分析和其他推文文本处理任务中。

Be more familiar with your data:

print('Number of positive tweets: ', len(all_positive_tweets))
print('Number of negative tweets: ', len(all_negative_tweets))

print('\nThe type of all_positive_tweets is: ', type(all_positive_tweets))
print('The type of a tweet entry is: ', type(all_negative_tweets[0]))

The result is :

Number of positive tweets:  5000
Number of negative tweets: 5000

The type of all_positive_tweets is: <class 'list'>
The type of a tweet entry is: <class 'str'>
# print positive in greeen
print('\033[92m' + all_positive_tweets[random.randint(0,5000)])

# print negative in red
print('\033[91m' + all_negative_tweets[random.randint(0,5000)])

Result:

@charliehaarding @WeAlIlKnowA @ElliotPender @J_Mezzer smiling after he received a text from his dad saying suck him off :))
@cosplayamerica I'll be there around 10. My train was delayed. :(

Visualize:

# Declare a figure with a custom size
fig = plt.figure(figsize=(5, 5))

# labels for the two classes
labels = 'Positives', 'Negative'

# Sizes for each slide
sizes = [len(all_positive_tweets), len(all_negative_tweets)]

# Declare pie chart, where the slices will be ordered and plotted counter-clockwise:
plt.pie(sizes, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=90)

# Equal aspect ratio ensures that pie is drawn as a circle.
plt.axis('equal')

# Display the chart
plt.show()

Visualizing the data is always a gooddd habit.

Data preprocessing:

For NLP, the preprocessing steps are comprised of the following tasks:

  • Tokenizing the string
  • Lowercasing
  • Removing stop words and punctuation
  • Stemming
# download the stopwords from NLTK
nltk.download('stopwords')

Stop words are words that don’t add significant meaning to the text.

import re                                  # library for regular expression operations
import string # for string operations

from nltk.corpus import stopwords # module for stop words that come with NLTK
from nltk.stem import PorterStemmer # module for stemming
from nltk.tokenize import TweetTokenizer # module for tokenizing strings

Remove rubish text that is not helpful for our sentiment prediction task in Twitter dataset such as hashtag, retweet marks, and hyperlinks:

print('\033[92m' + tweet)
print('\033[94m')

# remove old style retweet text "RT"
tweet2 = re.sub(r'^RT[\s]+', '', tweet)

# remove hyperlinks
tweet2 = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet2)

# remove hashtags
# only removing the hash # sign from the word
tweet2 = re.sub(r'#', '', tweet2)

print(tweet2)

Run result:

RT @sd My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i

@sd My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off…

To tokenize means to split the strings into individual words without blanks or tabs. In this same step, we will also convert each word in the string to lower case. The tokenize module from NLTK allows us to do these easily:

print()
print('\033[92m' + tweet2)
print('\033[94m')

# instantiate tokenizer class
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
reduce_len=True)
# strip_handles=True 以 @ 开头的用户名会被移除。这通常在情感分析等任务中很有用,因为用户名本身通常不包含情感信息,而可能是噪声。
# reduce_len=True 会将长度为3或更长的重复字符序列缩减为长度为3的序列。例如,"coooool" 会变成 "coool","loooooove" 会变成 "loove"。这对于处理口语化、带有情感色彩的重复字符(如 "happyyyyy")非常有用,可以减少词汇量的膨胀,并提高词形归一化的效果。

# tokenize tweets
tweet_tokens = tokenizer.tokenize(tweet2)

print()
print('Tokenized string:')
print(tweet_tokens)

Result:

primary string:My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off…

Tokenized string:['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']

这样整理不高效,我以后只整理主要思路和关键细节。总体代码可以自己敲一遍

import re
import string
import numpy as np

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

def process_tweet(tweet):
"""Process tweet function.
Input:
tweet: a string containing a tweet
Output:
tweets_clean: a list of words containing the processed tweet

"""
stemmer = PorterStemmer()
stopwords_english = stopwords.words('english')
# remove stock market tickers like $GE
tweet = re.sub(r'\$\w*', '', tweet)
# remove old style retweet text "RT"
tweet = re.sub(r'^RT[\s]+', '', tweet)
# remove hyperlinks
tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
# remove hashtags
# only removing the hash # sign from the word
tweet = re.sub(r'#', '', tweet)
# tokenize tweets
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
reduce_len=True)
tweet_tokens = tokenizer.tokenize(tweet)

tweets_clean = []
for word in tweet_tokens:
if (word not in stopwords_english and # remove stopwords
word not in string.punctuation): # remove punctuation
# tweets_clean.append(word)
stem_word = stemmer.stem(word) # stemming word
tweets_clean.append(stem_word)

return tweets_clean

Building and Visualizing word frequencies

def build_freqs(tweets, ys):
"""Build frequencies.
Input:
tweets: a list of tweets
ys: an m x 1 array with the sentiment label of each tweet
(either 0 or 1)
Output:
freqs: a dictionary mapping each (word, sentiment) pair to its
frequency
"""
yslist = np.squeeze(ys).tolist()
freqs = {}
for tweet, senti_label in zip(tweets,yslist):
for word in process_tweet(tweet):
pair = (word,senti_label)
freqs[pair] = freqs.get(pair,0) + 1

return freqs
def build_freqs(tweets, ys):
"""Build frequencies.
Input:
tweets: a list of tweets
ys: an m x 1 array with the sentiment label of each tweet
(either 0 or 1)
Output:
freqs: a dictionary mapping each (word, sentiment) pair to its
frequency
"""
freqs = {}
yslist = np.squeeze(ys).tolist()
for tweet,y in zip(tweets,ys):
for word in process_tweet(tweet):
pair = (word,y)
if pair in freqs:
freqs[pair] += 1
else:
freqs[pair] = 1
return freqs
def build_freqs(tweets, ys):
"""Build frequencies.
Input:
tweets: a list of tweets
ys: an m x 1 array with the sentiment label of each tweet
(either 0 or 1)
Output:
freqs: a dictionary mapping each (word, sentiment) pair to its
frequency
"""
# Convert np array to list since zip needs an iterable.
# The squeeze is necessary or the list ends up with one element.
# Also note that this is just a NOP if ys is already a list.
yslist = np.squeeze(ys).tolist()

# Start with an empty dictionary and populate it by looping over all tweets
# and over all processed words in each tweet.
freqs = {}
for y, tweet in zip(yslist, tweets):
for word in process_tweet(tweet):
pair = (word, y)
if pair in freqs:
freqs[pair] += 1
else:
freqs[pair] = 1

return freqs
# select some words to appear in the report. we will assume that each word is unique (i.e. no duplicates)
keys = ['happi', 'merri', 'nice', 'good', 'bad', 'sad', 'mad', 'best', 'pretti',
'❤', ':)', ':(', '😒', '😬', '😄', '😍', '♛',
'song', 'idea', 'power', 'play', 'magnific']

# list representing our table of word counts.
# each element consist of a sublist with this pattern: [<word>, <positive_count>, <negative_count>]
data = []

# loop through our selected words
for word in keys:

# initialize positive and negative counts
pos = 0
neg = 0

# retrieve number of positive counts
if (word, 1) in freqs:
pos = freqs[(word, 1)]

# retrieve number of negative counts
if (word, 0) in freqs:
neg = freqs[(word, 0)]

# append the word counts to the table
data.append([word, pos, neg])

data
fig, ax = plt.subplots(figsize = (8, 8))

# convert positive raw counts to logarithmic scale. we add 1 to avoid log(0)
x = np.log([x[1] + 1 for x in data])

# do the same for the negative counts
y = np.log([x[2] + 1 for x in data])

# Plot a dot for each pair of words
ax.scatter(x, y)

# assign axis labels
plt.xlabel("Log Positive count")
plt.ylabel("Log Negative count")

# Add the word as the label at the same position as you added the points just before
for i in range(0, len(data)):
ax.annotate(data[i][0], (x[i], y[i]), fontsize=12)

ax.plot([0, 9], [0, 9], color = 'red') # Plot the red line that divides the 2 areas.
plt.show()

Visualizing tweets and the Logistic Regression model

image-20250530191703271

  1. 学会了算决策边界,用z(x)= logistics regression函数当分类函数的话,z>=0.5可归为正类,z<0.5归为负类。而使得z(x)=0.5的x值为0所以$\theta \cdot x = 0$若$\theta$已知便可以得到决策边界。

  2. 所用分类函数的梯度方向为决策箭头的方向

image-20250530184732516