Introduction

What “adding a language” means

A language plugin teaches PWAxcode how to:

  • Detect when a text likely belongs to that language.
  • Tokenize the text into spans (keywords, strings, numbers, comments, etc.) so the editor can highlight it.
  • Auto-indent content based on simple structural rules (e.g., braces).

You register all three via:

PWAxcode.registerLanguage({ id: 'your-language-id', detect: (txt) => /* return a score 0..1 */, tokenize: (src) => /* return [{ type?, text }, ...] */, autoIndent: (text, size=2) => /* return indented text */ });


Once registered, you can:

  • Let PWAxcode auto-detect by setting the language to auto.
  • Or force it explicitly with api.setOption('lang', 'arduino'); (replace with your ID).

A complete, working example: Arduino

Below is a fully working plugin for Arduino-style code (C/C++ syntax with Arduino functions). Use it as a template and adapt the parts marked with comments to your language.

// Register new language - Arduino // Plugin: Arduino (C/C++-like with Arduino functions) PWAxcode.registerLanguage({ id: 'arduino', // 2.1 Detect: return a confidence score [0..1] detect: (txt) => /void\s+setup\s*\(|void\s+loop\s*\(|#include\s*<Arduino\.h>/.test(txt) ? 1 : 0, // 2.2 Tokenize: convert source into [{ type?, text }, ...] tokenize: (src) => { const tokens = []; const push = (type, text) => tokens.push({ type, text }); const pushTxt = (text) => tokens.push({ text }); // Language vocabularies (customize per language) const KW = new Set(['if','else','for','while','do','switch','case','break','continue','return', 'const','volatile','static','struct','enum','typedef','sizeof','inline']); const TYPE = new Set(['void','bool','char','short','int','long','float','double','unsigned','signed', 'uint8_t','uint16_t','uint32_t','uint64_t','size_t']); const AFN = new Set(['setup','loop','pinMode','digitalWrite','digitalRead','analogWrite','analogRead', 'delay','millis','micros','attachInterrupt','detachInterrupt','Serial','Serial1','Serial2']); let i = 0, n = src.length; const isWS = c => /\s/.test(c); const readWhile = pred => { const j=i; while (i<n && pred(src[i])) i++; return src.slice(j,i); }; const readNumber = () => { const j=i; if (src[i] === '-') i++; while (/[0-9]/.test(src[i])) i++; if (src[i] === '.') { i++; while (/[0-9]/.test(src[i])) i++; } if (/[eE]/.test(src[i])) { i++; if (/[+-]/.test(src[i])) i++; while (/[0-9]/.test(src[i])) i++; } return src.slice(j,i); }; const readQuoted = q => { let out = src[i++]; // opening quote while (i<n) { const c = src[i]; out += c; i++; if (c === q) { break; } if (c === '\\' && i<n) { out += src[i]; i++; } } return out; }; while (i<n) { const ch = src[i], nx = src[i+1]; // 1) newline and whitespace must be preserved exactly if (ch === '\n') { pushTxt('\n'); i++; continue; } if (isWS(ch)) { pushTxt(readWhile(isWS)); continue; } // 2) preprocessor / directives if (ch === '#') { let j=i; while (j<n && src[j] !== '\n') j++; push('pre', src.slice(i,j)); i=j; continue; } // 3) comments (// and /* ... */) if (ch==='/' && nx==='/' ) { let j=i+2; while (j<n && src[j] !== '\n') j++; push('com', src.slice(i,j)); i=j; continue; } if (ch==='/' && nx==='*' ) { let j=i+2; while (j<n && !(src[j]==='*' && src[j+1]==='/')) j++; j=Math.min(n, j+2); push('com', src.slice(i,j)); i=j; continue; } // 4) strings / chars if (ch==='"' || ch==="'") { push('str', readQuoted(ch)); continue; } // 5) numbers if (/[0-9]/.test(ch) || (ch==='.' && /[0-9]/.test(nx))) { push('num', readNumber()); continue; } // 6) identifiers and function names if (/[A-Za-z_]/.test(ch)) { const j=i; while (i<n && /[A-Za-z0-9_]/.test(src[i])) i++; const id = src.slice(j,i); const low = id.toLowerCase(); if (KW.has(low)) { push('kw', id); continue; } if (TYPE.has(low)) { push('type', id); continue; } if (AFN.has(id)) { push('fn', id); continue; } // treat as function if immediately followed by '(' (after optional spaces) let k=i; while (k<n && /\s/.test(src[k])) k++; if (src[k] === '(') { push('fn', id); continue; } pushTxt(id); continue; } // 7) punctuation / operators if (',;(){}[].:+-/*%<>=!&|^~?'.includes(ch)) { push('pun', ch); i++; continue; } // 8) fallback pushTxt(ch); i++; } return tokens; }, // 2.3 Auto-indent: brace-based with comment/string awareness autoIndent: (text, size=2) => { const lines = text.replace(/\r\n?/g, '\n').split('\n'); let depth=0, inBlockCom=false, inStr=false, q=''; const out=[]; for (let raw of lines) { const trimmed = raw.replace(/^[ \t]+/, ''); let level = depth; if (trimmed.startsWith('}') || trimmed.startsWith(');') || trimmed === '}') { level = Math.max(0, level-1); } out.push(' '.repeat(level*size) + trimmed); let i=0, n=raw.length, inLineCom=false; while (i<n) { const ch = raw[i], nx = raw[i+1]; if (inLineCom) break; if (inBlockCom) { if (ch==='*' && nx=== '/') { inBlockCom=false; i+=2; continue; } i++; continue; } if (inStr) { if (ch==='\\'){ i+=2; continue; } if (ch===q){ inStr=false; } i++; continue; } if (ch==='/'&&nx==='/' ){ inLineCom=true; break; } if (ch==='/'&&nx==='*' ){ inBlockCom=true; i+=2; continue; } if (ch==='"'||ch==="'"){ inStr=true; q=ch; i++; continue; } if (ch==='{'){ depth++; i++; continue; } if (ch==='}'){ depth = Math.max(0, depth-1); i++; continue; } i++; } } return out.join('\n'); } });

The three building blocks in detail

  • id required

    A short, unique string (e.g., "arduino", "toml", "kotlin"). Users can force your language by setting lang to this exact id.

  • detect(txt) → number

    Returns a confidence score between 0 and 1 (0 = not a match, 1 = certain). When the editor is set to lang: "auto", PWAxcode calls every registered detect and chooses the language with the highest score..

    Tips:
    • Match distinctive markers (keywords, directives, file-signature lines).
    • You can implement partial scoring to be more nuanced:
    detect: (txt) => { let score = 0; if (/void\s+setup\s*\(/.test(txt)) score += 0.5; if (/void\s+loop\s*\(/.test(txt)) score += 0.4; if (/#include\s*<Arduino\.h>/.test(txt)) score += 0.6; return Math.min(1, score); }

    If a user explicitly selects your language (setOption('lang','arduino');), detect is bypassed.

  • tokenize(src) → Token[]

    Receives the raw source and must return an array of tokens. Each token is an object with:

    • text (string): the original text for that token (must be preserved exactly).
    • type (optional string): a style hint used by the highlighter.
    Important guidelines:
    • Never drop characters. The concatenation of all text fields must exactly equal the original input.
    • Preserve newlines as their own tokens (e.g., "\n"), so line numbers and overlays stay accurate.
    • Keep it linear (O(n)). The Arduino example uses a pointer i and tiny helpers to read whitespace, numbers, and quoted strings efficiently.
    • Token type names are up to you; typical sets include:
      • kw (keyword), type, fn (function), str (string), num (number), com (comment),
        pre (preprocessor/directive), pun (punctuation/operator).
    • If you have extra categories (e.g., annotation, macro, const), add matching CSS rules in your theme.

    Whitespace & newlines
    The example groups runs of whitespace into a single plain token and emits bare "\n" tokens. This keeps the view fast and accurate.


    Comments & strings

    The tokenizer must not mistake comment markers inside strings (and vice versa). The Arduino example uses readQuoted() and checks for escapes (\", \').


    Identifiers

    The example recognizes identifiers (/[A-Za-z_][A-Za-z0-9_]*/), checks them against:

    • KW (keywords),
    • TYPE (types),
    • AFN (Arduino functions),
      and finally looks ahead for a ( to guess function calls.


    Numbers

    It accepts integers, decimals, and exponents. Extend it if you need hex (0x), binary (0b), numeric separators, etc.


    Punctuation

    Single-char fallback for operators and delimiters. For multi-char operators (==, !=, &&, ||, <<, >>, ::) you can read two chars at once and emit a single token.


  • autoIndent(text, size=2) → string

    Given a full text, returns a re-indented version. This is a best-effort formatter—lightweight and predictable.

    The Arduino version:
    • Normalizes line endings to \n.
    • Tracks structure with a small state machine that is aware of comments and strings, so braces inside comments/strings are ignored.
    • Dedents lines that start with a closing brace (}) (and a special case of );), then re-indents subsequent lines as it sees { / }.
    • Uses the size parameter to define the indent step (spaces per level).

    Customize this to your language: for example, increase/decrease indent after case/default, align else with matching if, or handle begin / end pairs.

Styling your tokens (themes)

The highlighter adds a CSS class for each token type (e.g., .kw, .type, .fn). To make your language look great, ensure your theme (or site CSS) styles the classes you emit. For example:

/* Example theme hooks */ .pxc .hl .kw { font-weight: 600; } .pxc .hl .type { color: var(--pxc-accent-7); } .pxc .hl .fn { color: var(--pxc-accent-6); } .pxc .hl .str { color: var(--pxc-green-6); } .pxc .hl .num { color: var(--pxc-orange-6); } .pxc .hl .com { color: var(--pxc-gray-6); font-style: italic; } .pxc .hl .pre { color: var(--pxc-purple-6); }


If you introduce new token types (e.g., .macro), add rules for them.

Enabling and testing your language

  • Register it (once, after PWAxcode is on the page).
  • Pick a mode:
    • Auto: api.setOption('lang', 'auto') (lets your detect run).
    • Manual: api.setOption('lang', 'arduino') (force your ID).
  • Turn highlighting on if needed (api.toggleHighlight() or via the UI).
  • Debug tokens: enable the token overlay if available (e.g., api.showTokens(true)) to visually inspect categories.
  • Try edge cases:
    • Empty files, only comments, only strings.
    • Long lines, mixed line endings (CRLF / LF).
    • Deeply nested braces.
    • Code samples that contain // or /* ... */ inside strings.
  • Auto-indent a selection or the whole text and confirm:
    • Lines starting with } dedent.
    • Braces inside strings/comments don’t shift depth.
    • Your chosen indent width (2/4 spaces) is applied consistently.

Common pitfalls and how to avoid them

  • Dropped characters or drifting lines
    Every character must be emitted exactly once. Preserve newlines as "\n" tokens. If line overlays look misaligned, check your whitespace/newline handling first.
  • Infinite loops
    In tokenize, ensure the index i always advances—even in “fallback” branches.
  • Over-greedy patterns Avoid super-broad regexes that scan the whole text repeatedly. Prefer a single pass with small helpers, as in the example.
  • Function detection false positives
    If your language has identifiers followed by ( that aren’t functions (e.g., macro(arg)), add dedicated sets (e.g., MACRO) or tighter checks.
  • Indentation vs. comments/strings
    Your autoIndent must ignore braces inside comments and strings. Keep the small state machine (line vs. block comments, quote tracking, escapes).
  • Custom types without CSS
    If you emit token types that your theme doesn’t style, they’ll render like plain text. Add minimal CSS hooks.

Adapting the template to other languages

  • Replace the keyword/type/function sets with your language’s vocabulary.
  • Extend numbers (hex, binary, suffixes like L, UL, f).
  • Add operators (e.g., ===, =>, ??, ?.) as multi-char tokens.
  • Enhance detect to look for distinctive constructs (shebangs, package lines, file headers).
  • Tune autoIndent to the structure of your language (e.g., begin / end, case blocks, significant indentation).

Quick checklist

  • Unique id chosen and registered.
  • detect returns a sensible 0..1 score.
  • tokenize is single-pass, preserves all text, emits newlines.
  • Token categories match your theme’s CSS.
  • autoIndent ignores braces inside comments/strings.
  • Large files remain smooth (no quadratic regexes).
  • Tested with highlight on/off, wrap on/off, different fonts and tab sizes.

With this structure in place, you can drop in the Arduino example, rename the sets, and evolve it into a robust highlighter for any language you need.

Arduino code demo

Arduino

void setup() {
pinMode(LED, OUTPUT);
}

void loop() {
digitalWrite(LED, HIGH);
delay(500);
digitalWrite(LED, LOW);
delay(500);
}