词元

词元（英語：token）是自然语言处理中用于表示文本的基本单位之一。词元可以是词、子词、字符，或其他按规则切分的字符串片段。将输入文本切分为词元序列的过程称为词元化（tokenization）。在现代语言模型中，文本通常先经过词元化，再映射为词汇表索引与向量表示后进入模型计算。^[1]

技术原理

词元化的作用，是将连续文本转换为模型可处理的离散单位。经过切分后，每个词元会与词汇表中的一个索引相对应，并可进一步映射为向量表示，以供模型计算。^[1]

常见词元化方案

词级与字符级

较早的词元化方法常以词或字符为基本单位。词级词元化语义较直观，但容易遭遇词汇表规模过大及未登录词问题；字符级词元化可缓解未登录词问题，但通常会使输入序列变长。^[1]

子词词元化

现代自然语言处理与语言模型更常使用子词作为词元，以在词汇表规模与序列长度之间取得平衡。常见方案包括：^[1]

字节对编码（Byte-Pair Encoding，BPE）：Rico Sennrich等人将其用于神经机器翻译中的开放词汇处理。^[2]
WordPiece：一种子词词表方法，BERT采用此方案。^[3]
SentencePiece：一种语言无关的子词切分框架，可直接从原始句子训练子词模型，而无须预先按空格分词。^[4]

在语言模型中的作用

词元序列是现代语言模型常见的输入与输出表示。词元化方案会影响词汇表大小、序列长度以及模型可处理的上下文长度。^[1]有研究指出，不同语言在同一分词器下可能产生差异显著的词元长度，从而影响上下文利用率、处理延迟与商业服务成本。^[5]

中文译名与用法

在中国大陆近年的媒体与官方表述中，英语：token常被译作“词元”。《人民日报》曾将其解释为“通常所说的词元”，并称其为“处理文本的最小数据单元”或“大模型处理信息的最小信息单元”。^[6]^[7]

参见

参考资料

^ ^1.0 ^1.1 ^1.2 ^1.3 ^1.4 Jurafsky, Daniel; Martin, James H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models 3rd. Stanford University. 2026 （英语）.
^ Sennrich, Rico; Haddow, Barry; Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics: 1715–1725. 2016. doi:10.18653/v1/P16-1162 （英语）.
^ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics: 4171–4186. 2019. doi:10.18653/v1/N19-1423 （英语）.
^ Kudo, Taku; Richardson, John. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Brussels, Belgium: Association for Computational Linguistics: 66–71. 2018. doi:10.18653/v1/D18-2012 （英语）.
^ Petrov, Aleksandar; La Malfa, Emanuele; Torr, Philip H. S.; Bibi, Adel. Language Model Tokenizers Introduce Unfairness Between Languages. Advances in Neural Information Processing Systems 36. 2023 （英语）.
^ 王云杉. 漫谈词元（新知）. 人民日报. 2026-01-28: 05.
^ 王云杉. 我国日均词元调用量突破140万亿. 人民日报. 2026-03-24: 08.

[jurafsky-1] 1.0 ^1.1 ^1.2 ^1.3 ^1.4 Jurafsky, Daniel; Martin, James H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models 3rd. Stanford University. 2026 （英语）.

[2] Sennrich, Rico; Haddow, Barry; Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics: 1715–1725. 2016. doi:10.18653/v1/P16-1162 （英语）.

[3] Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics: 4171–4186. 2019. doi:10.18653/v1/N19-1423 （英语）.

[4] Kudo, Taku; Richardson, John. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Brussels, Belgium: Association for Computational Linguistics: 66–71. 2018. doi:10.18653/v1/D18-2012 （英语）.

[5] Petrov, Aleksandar; La Malfa, Emanuele; Torr, Philip H. S.; Bibi, Adel. Language Model Tokenizers Introduce Unfairness Between Languages. Advances in Neural Information Processing Systems 36. 2023 （英语）.

[6] 王云杉. 漫谈词元（新知）. 人民日报. 2026-01-28: 05.

[7] 王云杉. 我国日均词元调用量突破140万亿. 人民日报. 2026-03-24: 08.

[1]

[2]

[3]

[4]

[5]

[6]

[7]