在本書中,我們經(jīng)常會使用表示為單詞、字符或單詞序列的文本數(shù)據(jù)。首先,我們需要一些基本工具來將原始文本轉(zhuǎn)換為適當形式的序列。典型的預(yù)處理流水線執(zhí)行以下步驟:
-
將文本作為字符串加載到內(nèi)存中。
-
將字符串拆分為標記(例如,單詞或字符)。
-
構(gòu)建一個詞匯詞典,將每個詞匯元素與一個數(shù)字索引相關(guān)聯(lián)。
-
將文本轉(zhuǎn)換為數(shù)字索引序列。
import collections
import random
import re
import torch
from d2l import torch as d2l
import collections
import random
import re
import tensorflow as tf
from d2l import tensorflow as d2l
9.2.1. 讀取數(shù)據(jù)集
在這里,我們將使用 HG Wells 的The Time Machine,這是一本 30000 多字的書。雖然實際應(yīng)用程序通常會涉及大得多的數(shù)據(jù)集,但這足以演示預(yù)處理管道。以下_download
方法將原始文本讀入字符串。
class TimeMachine(d2l.DataModule): #@save
"""The Time Machine dataset."""
def _download(self):
fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
'090b5e7e70c295757f55df93cb0a180b9691891a')
with open(fname) as f:
return f.read()
data = TimeMachine()
raw_text = data._download()
raw_text[:60]
'時間機器,HG Wells [1898]nnnnnInnnThe Time Tra'
class TimeMachine(d2l.DataModule): #@save
"""The Time Machine dataset."""
def _download(self):
fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
'090b5e7e70c295757f55df93cb0a180b9691891a')
with open(fname) as f:
return f.read()
data = TimeMachine()
raw_text = data._download()
raw_text[:60]
Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...
'The Time Machine, by H. G. Wells [1898]nnnnnInnnThe Time Tra'
class TimeMachine(d2l.DataModule): #@save
"""The Time Machine dataset."""
def _download(self):
fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
'090b5e7e70c295757f55df93cb0a180b9691891a')
with open(fname) as f:
return f.read()
data = TimeMachine()
raw_text = data._download()
raw_text[:60]
'The Time Machine, by H. G. Wells [1898]nnnnnInnnThe Time Tra'
class TimeMachine(d2l.DataModule): #@save
"""The Time Machine dataset."""
def _download(self):
fname = d2l.download(d2l.DATA_URL + 'timemachine.txt', self.root,
'090b5e7e70c295757f55df93cb0a180b9691891a')
with open(fname) as f:
return f.read()
data = TimeMachine()
raw_text = data._download()
raw_text[:60]
'The Time Machine, by H. G. Wells [1898]nnnnnInnnThe Time Tra'
為簡單起見,我們在預(yù)處理原始文本時忽略標點符號和大寫字母。
@d2l.add_to_class(TimeMachine) #@save
def _preprocess(self, text):
return re.sub('[^A-Za-z]+', ' ', text).lower()
text = data._preprocess(raw_text)
text[:60]
'the time machine by h g wells i the time traveller for so it'
'the time machine by h g wells i the time traveller for so it'
'the time machine by h g wells i the time traveller for so it'
9.2.2. 代幣化
標記是文本的原子(不可分割)單元。每個時間步對應(yīng) 1 個 token,但究竟什么是 token 是一種設(shè)計選擇。例如,我們可以將句子“Baby needs a new pair of shoes”表示為一個包含 7 個單詞的序列,其中所有單詞的集合包含一個很大的詞匯表(通常是數(shù)萬或數(shù)十萬個單詞)。或者我們將同一個句子表示為更長的 30 個字符序列,使用更小的詞匯表(只有 256 個不同的 ASCII 字符)。下面,我們將預(yù)處理后的文本標記為一系列字符。
't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, '
't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, '
't,h,e, ,t,i,m,e, ,m,a,c,h,i,n,e, ,b,y, ,h, ,g, ,w,e,l,l,s, '
9.2.3. 詞匯
這些標記仍然是字符串。然而,我們模型的輸入最終必須由數(shù)值輸入組成。接下來,我們介紹一個用于構(gòu)建詞匯表的類,即,將每個不同的標記值與唯一索引相關(guān)聯(lián)的對象。首先,我們確定訓練語料庫中的唯一標記集。然后我們?yōu)槊總€唯一標記分配一個數(shù)字索引。為方便起見,通常會刪除不常用的詞匯元素。Whenever we encounter a token at training or test time that had not been previously seen or was dropped from the vocabulary, we represent it by a special “” token, signifying that this is an unknown value.
class Vocab: #@save
"""Vocabulary for text."""
def __init__(self, tokens=[], min_freq=0, reserved_tokens=[]):
# Flatten a 2D list if needed
if tokens and isinstance(tokens[0], list):
tokens = [token for line in tokens for token in line]
# Count token frequencies
counter = collections.Counter(tokens)
self.token_freqs = sorted(counter.items(), key=lambda x: x[1],
reverse=True)
# The list of unique tokens
self.idx_to_token = list(sorted(set([''] + reserved_tokens + [
token for token, freq in self.token_freqs if freq >= min_freq])))
self.token_to_idx = {token: idx
for idx, token in enumerate(self.idx_to_token)}
def __len__(self):
return len(self.idx_to_token)
def __getitem__(self, tokens):
if not isinstance(tokens, (list, tuple)):
return self.token_to_idx.get(tokens,
評論