TMCRA TokenGraph-LLM Stage C

TMCRA TokenGraph-LLM is an experimental graph-native autoregressive language model prototype. It is not a Transformer wrapper and does not call an external LLM at inference time. Text is generated from token-level graph encoding, learned edge gates, graph message passing, and a dynamic graph causal decoder.

This Hugging Face repository hosts model artifacts. Full source code, training scripts, graph builders, and documentation are published in the GitHub repository:

https://github.com/reshuibuduo/TMCRA-TokenGraph-LLM

Current Model

Current default checkpoint:

release line: v0.2.0-stagec
package: tmcra_tokengraph_stagec_model_package_20260606.zip
checkpoint inside package: checkpoint/token_graph_dynamic_decoder_v3.pt
parameters: 114,615,372
shape: dim=512, graph_layers=8, decoder_layers=10
embeddings: untied
precision during training: bf16
effective training samples: about 1.03M
training steps: 62,000
SHA256: cc23285628eaed47c20009b6be6b5eb0600ded57ac2e09519370d97158fecd33

Legacy v0.1 files may still be present in this repository for historical comparison. The Stage C package is the current recommended artifact.

Package Contents

The Stage C zip contains:

checkpoint/token_graph_dynamic_decoder_v3.pt
dataset_metadata/tokenizer.json
dataset_metadata/manifest.json
training_summary_stagec_public.json
docs/TMCRA_TOKENGRAPH_STAGEC_TECHNICAL_OVERVIEW.md
docs/TMCRA_TOKENGRAPH_STAGEC_TECHNICAL_OVERVIEW_ZH.md
docs/STAGEC_DETAILED_BENCHMARK_SMOKE_20260606.md
MODEL_CARD.md
PACKAGE_MANIFEST.md
SHA256SUMS.txt

Full-Chain Training Code

The GitHub source repository now includes the full-chain Stage C training path:

open-corpus schema2 conversion scripts;
optional semantic teacher annotation through OpenAI-compatible or local Hugging Face models;
token-level reasoning graph builders;
simple_plus_causal_target graph mode;
Stage C training and checkpoint continuation;
graph ablation and token attribution evaluation.

Start from:

docs/FULL_CHAIN_TRAINING.md
docs/FULL_CHAIN_TRAINING_ZH.md
scripts/run_stagec_full_chain_template.sh
scripts/run_stagec_sharded_training_template.sh

How Next-Token Generation Works

Stage C predicts the next token through a graph-native causal path:

flowchart TD
    A["Text / prompt / source segments / target text"] --> B["Tokenizer"]
    B --> C["Token Graph Builder"]
    C --> D["Token nodes"]
    C --> E["Typed candidate edges"]
    D --> G["TokenGraphEncoderV3"]
    E --> G
    G --> H["Encoded context graph states"]
    H --> I["Dynamic Token Graph Decoder"]
    I --> J["Generated token node"]
    J --> I
    I --> K["Next-token distribution"]

schema2 text
  -> token graph nodes and typed candidate edges
  -> learned edge-gated graph propagation
  -> dynamic generated-token graph nodes
  -> prefix-edge + context-edge gated decoding
  -> vocabulary logits

The graph builder proposes token nodes and typed candidate edges. The model then learns edge gates, propagates messages through the token graph, scores context nodes, and decodes each generated token as a dynamic graph node. The decoder combines a learned prefix message from previous generated-token nodes with a learned context message from encoded graph nodes, then maps the updated graph-decoder state to next-token logits. This keeps the main objective as next-token prediction while making generation depend on typed graph structure rather than Transformer self-attention.

Single-step decoding:

flowchart LR
    A["Encoded context graph<br/>N nodes"] --> D["Context gate"]
    B["Generated prefix nodes<br/>window W"] --> C["Prefix gate"]
    C --> E["Generated-token node t"]
    D --> E
    E --> F["Graph decoder update"]
    F --> G["Vocabulary logits"]
    G --> H["next token"]
    H --> I["Append as graph node"]

Complexity Growth

Dense Transformer self-attention grows roughly as:

O(n^2 * d)

Stage C replaces sequence-wide all-token attention with graph candidate edges and dynamic graph decoding:

Graph encoder:        O(L_g * (N + E) * d)
Dynamic prefix path:  O(L_d * T * W * d)
Context tunnel path:  O(L_d * T * N * d)

where N is context graph nodes, E is candidate edges, T is generated length, and W is the bounded generated-prefix window. The current context tunnel still scans encoded context nodes; the accurate claim is not constant-time generation, but replacing dense sequence-wide self-attention with sparse typed graph propagation plus explicit context tunneling.

Current Capability Boundary

Stage C can generate early English story-style continuations and shows measurable dependence on typed graph edges. It is not a reliable production LLM. Current weak areas include exact factual QA, numeric reasoning, robust instruction following, grammar stability, multilingual generation, and long-range concept binding.

Smoke results:

evaluation	result
Stage C normal total loss	6.512117
Stage C normal LM loss	4.641285
no_edges total loss	8.310654
shuffle_edges total loss	7.702783
TinyStories avg words	73.88
BLiMP smoke	59%-64%

These are smoke tests, not leaderboard claims.

License

MIT.

Downloads last month: -; Downloads are not tracked for this model. How to track