Platforms to show: All Mac Windows Linux Cross-Platform

LlamaVocabMBS class

Type Topic Plugin Version macOS Windows Linux iOS Targets
class Llama MBS Tools Plugin 25.5 ✅ Yes ✅ Yes ✅ Yes ✅ Yes All
The class for a vocabulary.

This is an abstract class. You can't create an instance, but you can get one from various plugin functions.

  • 8 properties
    • property bos as Int32
    • property eot as Int32
    • property Handle as Integer
    • property nl as Int32
    • property n_tokens as Integer
    • property pad as Int32
    • property sep as Int32
    • property type as Integer
  • 10 methods
    • method Constructor   Private
    • method Destructor
    • method Detokenize(tokens() as Int32, removeSpecial as boolean, unparseSpecial as Boolean) as String
    • method isControl(Token as Int32) as Boolean
    • method isEOG(Token as Int32) as Boolean
    • method Text(Token as Int32) as String
    • method TokenAttributes(Token as Int32) as Integer
    • method Tokenize(text as String, AddSpecial as Boolean = true, ParseSpecial as Boolean = true) as Int32()
    • method TokenScore(Token as Int32) as Single
    • method TokenToPiece(Token as Int32, special as boolean = true) as String
  • 19 constants

Constants

Constant Value Description
TokenNull -1 The value used for a null token.

Token Attributes

Constant Value Description
TokenAttrByte 32 Byte fallback token (e.g. used for bytes not covered by BPE merges)
TokenAttrControl 8 Control token (e.g. special tokens, separators, directives) — tokens that don’t map directly to output text.
TokenAttrLstrip 128 Token is to be left‑stripped (i.e. leading whitespace removal)
TokenAttrNormal 4 A “normal” token (non‑special, non‑control) — a basic lexical token.
TokenAttrNormalized 64 Token has been normalized (e.g. transformed / canonicalized)
TokenAttrRstrip 256 Token is to be right‑stripped (i.e. trailing whitespace removal)
TokenAttrSingleWord 512 Token represents a single word (i.e. atomic word, not subword)
TokenAttrUndefined 0 No attribute set / default / “no classification”
TokenAttrUnknown 1 Token whose status is unknown (e.g. not in vocabulary or fallback)
TokenAttrUnused 2 Token that is present but not used / deprecated / reserved.
TokenAttrUserDefined 16 A user‑defined token (e.g. custom token inserted by user / application)

Vocab Types

Constant Value Description
TypeBPE 2 GPT-2 tokenizer based on byte-level BPE
TypeNone 0 For models without vocab
TypePLAMO2 6 PLaMo-2 tokenizer based on Aho-Corasick with dynamic programming
TypeRWKV 5 RWKV tokenizer based on greedy tokenization
TypeSPM 1 LLaMA tokenizer based on byte-level BPE with byte fallback
TypeUGM 4 T5 tokenizer based on Unigram
TypeWPM 3 BERT tokenizer based on WordPiece

This class has no sub classes.

Blog Entries

Release notes

  • Version 26.1
    • Fixed a memory leak in Tokenize and Detokenize functions in LlamaVocabMBS class.

Some methods using this class:

Some properties using for this class:

Some related classes:


The items on this page are in the following plugins: MBS Tools Plugin.


LlamaSamplerMBS   -   LMFitControlMBS


The biggest plugin in space...