好久不写东西了，最近比较忙，还是边学边写比较好，不然事后很难再提起精神去总结。

本篇我会先介绍一下spacy的基本用法，然后尽量能够展示一些工程上的代码，方便学习使用。

安装使用

除了一般的安装spacy这个库，还需要安装一下你需要使用的模型，比如说英语任务就下载 en_core_web_sm

1 2	pip install spacy python -m spacy download en_core_web_sm

随后在python文件中，就可以轻松地导入使用，用法如下：

import spacy
nlp = spacy.load("en_core_web_sm")
text = "SpaCy is a popular NLP library. I love it."
doc = nlp(text)

doc对象中会保留句子的诸多处理信息，详情见下

主要功能

分词

其中，token.text就是分词的结果，token.pos_是词性，token.dep_是依赖关系

有些可能比较难理解，之后会配合例子，如果用到了就具体介绍一下

1 2	for token in doc: print(token.text, token.pos_, token.dep_)

结果：

SpaCy PROPN nsubj
is AUX ROOT
a DET det
popular ADJ amod
NLP NOUN compound
library NOUN attr
. PUNCT punct
I PRON nsubj
love VERB ROOT
it PRON dobj
. PUNCT punct

英语分词模型中一些词性标签示例：

“NOUN”：名词
“VERB”：动词
“ADJ”：形容词
“ADV”：副词
“PRON”：代词
“DET”：冠词
“ADP”：介词
“CONJ”：连词

依赖关系标签示例：

“nsubj”：主语
“dobj”：直接宾语
“prep”：介词短语修饰
“amod”：形容词修饰
“conj”：连接词成分
“root”：句子的根

分句

sent.text

spacy也可以把长文本分成多个短句来处理：

1 2	for sent in doc.sents: print(sent.text)

结果：

1 2	SpaCy is a popular NLP library. I love it.

sent.root

当然，sent中还保留了更多句子相关的信息，比如说可以使用 sent.root ，找到句子中的根词（通常是句子的主要动词或核心单词）

1 2	for sent in doc.sents: print(sent.root)

结果：

1
2

is
love

还可以进一步通过 sent.root.children ，访问根词的直接子词，这每一个字词又是之前的token对象，包含了 pos_、dep_ 等属性，可以拿来进一步分析句子成分

1
2
3

for sent in doc.sents:
    for child in sent.root.children:
        print(child.text, child.pos_, child.dep_)

结果：

SpaCy PROPN nsubj
library NOUN attr
. PUNCT punct
I PRON nsubj
it PRON dobj
. PUNCT punct

那比如说我就可以提取出句子中的主要主语对吧，判断一下 if child.dep_ == "nsubj" 即可

sent.noun_chunks

还有一些功能，比如说我可以找到句子中的名词短语

1
2
3

for sent in doc.sents:
    for phrase in sent.noun_chunks:
        print(phrase)

结果：

SpaCy
a popular NLP library
I
it

总结

主要接触的基本上就这些，根据具体使用场景自行使用