matchzoo.preprocessor package¶
Submodules¶
matchzoo.preprocessor.process_units module¶
Matchzoo toolkit for text pre-processing.
-
class
matchzoo.preprocessor.process_units.DigitRemovalUnit¶ 基类:
matchzoo.preprocessor.process_units.ProcessorUnitProcess unit to remove digits.
-
transform(tokens)¶ Remove digits from list of tokens.
参数: tokens ( list) -- list of tokens to be filtered.Return tokens: tokens of tokens without digits. 返回类型: list
-
-
class
matchzoo.preprocessor.process_units.LemmatizationUnit¶ 基类:
matchzoo.preprocessor.process_units.ProcessorUnitProcess unit for token lemmatization.
-
transform(tokens)¶ Lemmatization a sequence of tokens.
参数: tokens ( list) -- list of tokens to be lemmatized.Return tokens: list of lemmatizd tokens. 返回类型: list
-
-
class
matchzoo.preprocessor.process_units.LowercaseUnit¶ 基类:
matchzoo.preprocessor.process_units.ProcessorUnitProcess unit for text lower case.
-
transform(tokens)¶ Convert list of tokens to lower case.
参数: tokens ( list) -- list of tokens.Return tokens: lower-cased list of tokens. 返回类型: list
-
-
class
matchzoo.preprocessor.process_units.NgramLetterUnit¶ 基类:
matchzoo.preprocessor.process_units.StatefulProcessorUnitProcess unit for n-letter generation.
Triletter is used in
DSSMModel. This processor is expected to execute after Vocab has been created.Returned input_dim is the dimensionality of
DSSMModel.-
fit(tokens, ngram=3)¶ Fiitting parameters (shape of word hashing layer) for :DSSM:.
参数: - tokens (
list) -- list of tokens to be fitted. - ngram (
int) -- By default use 3-gram (tri-letter).
- tokens (
-
transform(tokens, ngram=3)¶ Transform token into tri-letter.
For example, word should be represented as #wo, wor, ord and rd#.
参数: - tokens (
list) -- list of tokens to be transformed. - ngram (
int) -- By default use 3-gram (tri-letter).
返回类型: list返回: set of tri-letters, dependent on ngram.
- tokens (
-
-
class
matchzoo.preprocessor.process_units.ProcessorUnit¶ 基类:
objectProcess unit do not persive state (i.e. do not need fit).
-
transform(input)¶ Abstract base method, need to be implemented in subclass.
-
-
class
matchzoo.preprocessor.process_units.PuncRemovalUnit¶ 基类:
matchzoo.preprocessor.process_units.ProcessorUnitProcess unit for remove punctuations.
-
transform(tokens)¶ Remove punctuations from list of tokens.
参数: tokens ( list) -- list of toekns.Return rv: tokens without punctuation. 返回类型: list
-
-
class
matchzoo.preprocessor.process_units.StatefulProcessorUnit¶ 基类:
matchzoo.preprocessor.process_units.ProcessorUnitProcess unit do persive state (i.e. need fit).
-
fit(input)¶ Abstract base method, need to be implemented in subclass.
-
state¶ Get current state.
-
-
class
matchzoo.preprocessor.process_units.StemmingUnit(stemmer='porter')¶ 基类:
matchzoo.preprocessor.process_units.ProcessorUnitProcess unit for token stemming.
-
transform(tokens)¶ Reducing inflected words to their word stem, base or root form.
参数: - tokens (
list) -- list of string to be stemmed. - stemmer -- stemmer to use, porter or lancaster.
引发: ValueError -- stemmer type should be porter or lancaster.
Return tokens: stemmed token.
返回类型: list- tokens (
-
-
class
matchzoo.preprocessor.process_units.StopRemovalUnit(lang='en')¶ 基类:
matchzoo.preprocessor.process_units.ProcessorUnitProcess unit to remove stop words.
-
get_stopwords()¶ Get stopwords based on language.
Params lang: language code. Return stop_list: list of stop words. 返回类型: list
-
transform(tokens)¶ Remove stopwords from list of tokenized tokens.
参数: - tokens (
list) -- list of tokenized tokens. - lang -- language code for stopwords.
Return tokens: list of tokenized tokens without stopwords.
返回类型: list- tokens (
-
-
class
matchzoo.preprocessor.process_units.TokenizeUnit¶ 基类:
matchzoo.preprocessor.process_units.ProcessorUnitProcess unit for text tokenization.
-
transform(input)¶ Process input data from raw terms to list of tokens.
参数: input ( str) -- raw textual input.Return tokens: tokenized tokens as a list. 返回类型: list
-
-
class
matchzoo.preprocessor.process_units.VocabularyUnit¶ 基类:
matchzoo.preprocessor.process_units.StatefulProcessorUnitVocabulary class.
Examples
>>> vocab = VocabularyUnit() >>> vocab.fit(['A', 'B', 'C', 'D', 'E']) >>> term_index = vocab.state['term_index'] >>> term_index {'E': 1, 'C': 2, 'D': 3, 'A': 4, 'B': 5} >>> index_term = vocab.state['index_term'] >>> index_term {1: 'C', 2: 'A', 3: 'E', 4: 'B', 5: 'D'}
>>> term_index['out-of-vocabulary-term'] 0 >>> index_term[0] '' >>> index_term[42] Traceback (most recent call last): ... KeyError: 42
>>> a_index = term_index['A'] >>> c_index = term_index['C'] >>> vocab.transform(['C', 'A', 'C']) == [c_index, a_index, c_index] True >>> vocab.transform(['C', 'A', 'OOV']) == [c_index, a_index, 0] True
>>> indices = vocab.transform('ABCDDZZZ') >>> ''.join(vocab.state['index_term'][i] for i in indices) 'ABCDD'
-
class
IndexTerm¶ 基类:
dictMap index to term.
-
class
TermIndex¶ 基类:
dictMap term to index.
-
transform(tokens)¶ Transform a list of tokens to corresponding indices.
返回类型: list
-
class