matchzoo.preprocessor package¶

Submodules¶

matchzoo.preprocessor.process_units module¶

Matchzoo toolkit for text pre-processing.

class matchzoo.preprocessor.process_units.DigitRemovalUnit¶

基类：matchzoo.preprocessor.process_units.ProcessorUnit

Process unit to remove digits.

transform(tokens)¶

Remove digits from list of tokens.

参数:	tokens (`list`) -- list of tokens to be filtered.
Return tokens:	tokens of tokens without digits.
返回类型:	`list`

class matchzoo.preprocessor.process_units.LemmatizationUnit¶

基类：matchzoo.preprocessor.process_units.ProcessorUnit

Process unit for token lemmatization.

transform(tokens)¶

Lemmatization a sequence of tokens.

参数:	tokens (`list`) -- list of tokens to be lemmatized.
Return tokens:	list of lemmatizd tokens.
返回类型:	`list`

class matchzoo.preprocessor.process_units.LowercaseUnit¶

基类：matchzoo.preprocessor.process_units.ProcessorUnit

Process unit for text lower case.

transform(tokens)¶

Convert list of tokens to lower case.

参数:	tokens (`list`) -- list of tokens.
Return tokens:	lower-cased list of tokens.
返回类型:	`list`

class matchzoo.preprocessor.process_units.NgramLetterUnit¶

基类：matchzoo.preprocessor.process_units.StatefulProcessorUnit

Process unit for n-letter generation.

Triletter is used in DSSMModel. This processor is expected to execute after Vocab has been created.

Returned input_dim is the dimensionality of DSSMModel.

fit(tokens, ngram=3)¶

Fiitting parameters (shape of word hashing layer) for :DSSM:.

参数:	tokens (`list`) -- list of tokens to be fitted. ngram (`int`) -- By default use 3-gram (tri-letter).

transform(tokens, ngram=3)¶

Transform token into tri-letter.

For example, word should be represented as #wo, wor, ord and rd#.

参数:	tokens (`list`) -- list of tokens to be transformed. ngram (`int`) -- By default use 3-gram (tri-letter).
返回类型:	`list`
返回:	set of tri-letters, dependent on ngram.

class matchzoo.preprocessor.process_units.ProcessorUnit¶

基类：object

Process unit do not persive state (i.e. do not need fit).

transform(input)¶: Abstract base method, need to be implemented in subclass.

class matchzoo.preprocessor.process_units.PuncRemovalUnit¶

基类：matchzoo.preprocessor.process_units.ProcessorUnit

Process unit for remove punctuations.

transform(tokens)¶

Remove punctuations from list of tokens.

参数:	tokens (`list`) -- list of toekns.
Return rv:	tokens without punctuation.
返回类型:	`list`

class matchzoo.preprocessor.process_units.StatefulProcessorUnit¶

基类：matchzoo.preprocessor.process_units.ProcessorUnit

Process unit do persive state (i.e. need fit).

fit(input)¶: Abstract base method, need to be implemented in subclass.

state¶: Get current state.

class matchzoo.preprocessor.process_units.StemmingUnit(stemmer='porter')¶

基类：matchzoo.preprocessor.process_units.ProcessorUnit

Process unit for token stemming.

transform(tokens)¶

Reducing inflected words to their word stem, base or root form.

参数:	tokens (`list`) -- list of string to be stemmed. stemmer -- stemmer to use, porter or lancaster.
引发:	ValueError -- stemmer type should be porter or lancaster.
Return tokens:	stemmed token.
返回类型:	`list`

class matchzoo.preprocessor.process_units.StopRemovalUnit(lang='en')¶

基类：matchzoo.preprocessor.process_units.ProcessorUnit

Process unit to remove stop words.

get_stopwords()¶

Get stopwords based on language.

Return stop_list:
Params lang:	language code.
	list of stop words.
返回类型:	`list`

transform(tokens)¶

Remove stopwords from list of tokenized tokens.

参数:	tokens (`list`) -- list of tokenized tokens. lang -- language code for stopwords.
Return tokens:	list of tokenized tokens without stopwords.
返回类型:	`list`

class matchzoo.preprocessor.process_units.TokenizeUnit¶

基类：matchzoo.preprocessor.process_units.ProcessorUnit

Process unit for text tokenization.

transform(input)¶

Process input data from raw terms to list of tokens.

参数:	input (`str`) -- raw textual input.
Return tokens:	tokenized tokens as a list.
返回类型:	`list`

class matchzoo.preprocessor.process_units.VocabularyUnit¶

基类：matchzoo.preprocessor.process_units.StatefulProcessorUnit

Vocabulary class.

Examples

>>> vocab = VocabularyUnit()
>>> vocab.fit(['A', 'B', 'C', 'D', 'E'])
>>> term_index = vocab.state['term_index']
>>> term_index  
{'E': 1, 'C': 2, 'D': 3, 'A': 4, 'B': 5}
>>> index_term = vocab.state['index_term']
>>> index_term  
{1: 'C', 2: 'A', 3: 'E', 4: 'B', 5: 'D'}

>>> term_index['out-of-vocabulary-term']
0
>>> index_term[0]
''
>>> index_term[42]
Traceback (most recent call last):
    ...
KeyError: 42

>>> a_index = term_index['A']
>>> c_index = term_index['C']
>>> vocab.transform(['C', 'A', 'C']) == [c_index, a_index, c_index]
True
>>> vocab.transform(['C', 'A', 'OOV']) == [c_index, a_index, 0]
True

>>> indices = vocab.transform('ABCDDZZZ')
>>> ''.join(vocab.state['index_term'][i] for i in indices)
'ABCDD'

class IndexTerm¶

基类：dict

Map index to term.

class TermIndex¶

基类：dict

Map term to index.

fit(tokens)¶: Build a TermIndex and a IndexTerm.

transform(tokens)¶

Transform a list of tokens to corresponding indices.

返回类型:	`list`

matchzoo.preprocessor package¶

Submodules¶

matchzoo.preprocessor.process_units module¶

Module contents¶