Skip to content

Duplicate token types and token documentation #1816

@doerwalter

Description

@doerwalter

I've used the following script to get all token types that are defined in any of the lexers:

from pygments import token, lexers

for (name, aliases, patterns, mimetypes) in lexers.get_all_lexers():
	for a in aliases:
		l = lexers.get_lexer_by_name(a)
		break
	else:
		for p in patterns:
			l = lexers.get_lexer_for_filename(p)
			break
		else:
			for m in mimetypes:
				l = lexers.get_lexer_for_mimetype(m)
				break

def list_tokens(t):
	yield t
	for st in t.subtypes:
		yield from list_tokens(st)

tokens = list(list_tokens(token.Token))

for t in sorted(tokens):
	print(t)

The output is the following:

Token
Token.Comment
Token.Comment.Directive
Token.Comment.Doc
Token.Comment.Hashbang
Token.Comment.Multi
Token.Comment.Multiline
Token.Comment.Preproc
Token.Comment.PreprocFile
Token.Comment.Single
Token.Comment.Singleline
Token.Comment.SingleLine
Token.Comment.Special
Token.Error
Token.Escape
Token.Generic
Token.Generic.Deleted
Token.Generic.Emph
Token.Generic.Error
Token.Generic.Heading
Token.Generic.Inserted
Token.Generic.Output
Token.Generic.Prompt
Token.Generic.Strong
Token.Generic.Subheading
Token.Generic.Traceback
Token.Generic.Whitespace
Token.Keyword
Token.Keyword.Builtin
Token.Keyword.Constant
Token.Keyword.Control
Token.Keyword.Declaration
Token.Keyword.Keyword
Token.Keyword.Namespace
Token.Keyword.PreProc
Token.Keyword.Pseudo
Token.Keyword.Removed
Token.Keyword.Reserved
Token.Keyword.Token
Token.Keyword.Tokens
Token.Keyword.Type
Token.Keyword.Word
Token.Literal
Token.Literal.Char
Token.Literal.Date
Token.Literal.Number
Token.Literal.Number.Attribute
Token.Literal.Number.Bin
Token.Literal.Number.Dec
Token.Literal.Number.Decimal
Token.Literal.Number.Float
Token.Literal.Number.Hex
Token.Literal.Number.Int
Token.Literal.Number.Integer
Token.Literal.Number.Integer.Long
Token.Literal.Number.Oct
Token.Literal.Number.Octal
Token.Literal.Number.Radix
Token.Literal.Other
Token.Literal.Scalar
Token.Literal.Scalar.Plain
Token.Literal.String
Token.Literal.String.Affix
Token.Literal.String.Atom
Token.Literal.String.Backtick
Token.Literal.String.Boolean
Token.Literal.String.Char
Token.Literal.String.Character
Token.Literal.String.Delimiter
Token.Literal.String.Doc
Token.Literal.String.Double
Token.Literal.String.Escape
Token.Literal.String.Heredoc
Token.Literal.String.Interp
Token.Literal.String.Interpol
Token.Literal.String.Moment
Token.Literal.String.Name
Token.Literal.String.Other
Token.Literal.String.Regex
Token.Literal.String.Single
Token.Literal.String.Symbol
Token.Name
Token.Name.Attribute
Token.Name.Attribute.Variable
Token.Name.Attributes
Token.Name.Builtin
Token.Name.Builtin.Pseudo
Token.Name.Builtin.Type
Token.Name.Builtins
Token.Name.Class
Token.Name.Class.DBS
Token.Name.Class.Start
Token.Name.Classes
Token.Name.Constant
Token.Name.Decorator
Token.Name.Entity
Token.Name.Entity.DBS
Token.Name.Exception
Token.Name.Field
Token.Name.Function
Token.Name.Function.Magic
Token.Name.Keyword
Token.Name.Keyword.Tokens
Token.Name.Label
Token.Name.Namespace
Token.Name.Operator
Token.Name.Other
Token.Name.Other.Member
Token.Name.Property
Token.Name.Pseudo
Token.Name.Quoted
Token.Name.Quoted.Escape
Token.Name.Symbol
Token.Name.Tag
Token.Name.Type
Token.Name.Variable
Token.Name.Variable.Anonymous
Token.Name.Variable.Class
Token.Name.Variable.Global
Token.Name.Variable.Instance
Token.Name.Variable.Magic
Token.Operator
Token.Operator.DBS
Token.Operator.Word
Token.Other
Token.OutPrompt
Token.OutPromptNum
Token.Prompt
Token.PromptNum
Token.Punctuation
Token.Punctuation.Indicator
Token.Text
Token.Text.Symbol
Token.Text.Whitespace

Some of those tokens seem to be duplicates or typos:

  • Token.Comment.SingleLine (used by BSTLexer in lexers/bibtex.py), Token.Comment.Singleline (used by FloScriptLexer in lexers/floscript.py) and Token.Comment.Single (used anywhere else).
  • Token.Comment.Multi (used by CleanLexer in lexers/clean.py and Token.Comment.Multiline (used anywhere else).
  • Token.Literal.Number.Dec (used by NuSMVLexer in lexers/smv.py) and Token.Literal.Number.Decimal (used anywhere else).
  • Token.Literal.Number.Int (used by CddlLexer in lexers/cddl.py) and Token.Literal.Number.Integer (used anywhere else).
  • Token.Literal.Number.Octal (used by ThingsDBLexer in lexers/thingsdb.py) and Token.Literal.Number.Oct (used anywhere else).

The following are inconclusive:

  • Token.Literal.String.Character (used by UniconLexer and IconLexer in lexers/unicon.py and DelphiLexer in lexers/pascal.py) and (Token.Literal.String.Char (used anywhere else)
  • Token.Literal.String.Interp (used by ColdfusionLexer in lexers/templates.py) and Token.Literal.String.Interpol (use anywhere else). But here "Interp" might mean interpreted, not interpolated.

Would it make sense to consolidate those tokens that are clearly typos?

Also it would be really helpful to have documentation what each token is supposed to represent (for example, what is Token.Operator.DBS?). An attribute __doc__ on each token object would really help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions