AIが推論を行っている時に、中で何が起きているのかを、ローカルLLMを使って可視化してみた

Last updated at 2026-05-19Posted at 2026-05-19

今回、新しくAIの案件にアサインされることになり、AIアプリ自体は開発したことはあるものの、実際あまり時間が無くて内部構造は触れずに来ていた。
内部構造といっても、LLMが一般に公開された初期に大前提として「AIというのは、結局のところ特定の言葉から次の言葉を確率モデルに従って予測しているだけである」ということは知っていた。
ただ、LLMがこれだけ進化して、推論も相当精度が上がったところで、中で何が変わっているのかは知らなかったので、その勉強用にわざわざスクリプトを組んで実験を行ったということである。

実験用スクリプトの概要

https://github.com/wara714/local-llm-runtime
もしソースコードを見られたい方は、こちらのリポジトリをcloneいただければと思う。python3.11以上で動作する。下記に示すモデルを使用すれば、CPUのマシンにて問題なく動作する。

構成としては、ローカルLLM + 推論機構 + APIのみのシンプルな構成とした。ローカルLLMはHuggingfaceのQwen/Qwen2.5-0.5B-Instructを使用した。今回はLLMがどうやって推論をするのか？を調べただけなので、さほど思いモデルは採用しなかった。

そして、ローカルLLMを中核に置きつつ、OpenAIのResponsiveAPIと同一の構造のAPIを模擬的に実装した。これにより、外からAPIを叩いたときに、いったい中で何が起こっているのか？を全て見ることができる。

何を実験したか

今回知りたかったのは、「LLMに推論なるものをさせたときに、中で何が起きるのか」である。そこで今回は荒業であるが、全関数の全IN, OUTを都度標準出力にoutputするようにしている。これで全てのデータの経路が可視化される。

※loggerで出そうと思ったが設定が思ったより面倒だったので、printでやってしまった。一応後から消せるようになっているので、ご容赦いただきたい。

結果

ログをそのまま張ろうと思う。やたら長いので閲覧注意である
※AIの回答が一部入るが、ChatGPTなどソースが公開されていないモデルに対しての言及に関しては、推測の域を出ないことをご了承いただきたい

[DEBUG-TRACE] enable_full_tracing | ENABLING FULL DEBUG TRACING for local_llm_runtime
14:05:38 [TRACE] ============================================================
14:05:38 [TRACE] ENABLING FULL DEBUG TRACING for local_llm_runtime
14:05:38 [TRACE] ============================================================
[DEBUG-TRACE] instrument_module | module=local_llm_runtime.api
[DEBUG-TRACE] enable_full_tracing | Instrumented: local_llm_runtime.api
14:05:38 [TRACE] Instrumented: local_llm_runtime.api
[DEBUG-TRACE] → IN  | func=main | file=api/server.py:17 | args=(sys.argv=['-m', '--port', '8000'])
[DEBUG-TRACE]   ↔   | func=main | file=api/server.py:17 | starting uvicorn on 0.0.0.0:8000
[DEBUG-TRACE] → IN  | func=create_app | file=api/app.py:24 | args=(title='Local LLM Runtime — OpenAI Responses API Compatible', version='0.1.0')
[DEBUG-TRACE] ← OUT | func=create_app | file=api/app.py:24 | return=<FastAPI app>
INFO:     Started server process [43277]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
[DEBUG-TRACE] → IN  | func=get_runtime | file=api/dependencies.py:65 | args=(existing=False)
[DEBUG-TRACE] ← OUT | func=get_runtime | file=api/dependencies.py:65 | return=RuntimeState(model=None, loaded=False)
[DEBUG-TRACE] → IN  | func=create_response | file=api/routes.py:115 | args=(req.model='local', req.input='Hello', req.stream=False)
[DEBUG-TRACE] → IN  | func=RuntimeState.ensure_loaded | file=api/dependencies.py:54 | args=(model_name=None, is_loaded=False)
[DEBUG-TRACE] → IN  | func=RuntimeState.load_model | file=api/dependencies.py:35 | args=(model_name='Qwen/Qwen2.5-0.5B-Instruct')
2026-05-16 14:05:43,471 INFO local_llm_runtime.api.dependencies: Loading model: Qwen/Qwen2.5-0.5B-Instruct
[DEBUG-TRACE] → IN  | func=TokenizerAdapter.__init__ | file=model/tokenizer_adapter.py:12 | args=(model_name='Qwen/Qwen2.5-0.5B-Instruct')
2026-05-16 14:05:46,155 INFO httpx: HTTP Request: HEAD https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
2026-05-16 14:05:46,155 WARNING huggingface_hub.utils._http: Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
2026-05-16 14:05:46,163 INFO httpx: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/Qwen/Qwen2.5-0.5B-Instruct/7ae557604adf67be50417f59c2c2f167def9a775/config.json "HTTP/1.1 200 OK"
2026-05-16 14:05:46,339 INFO httpx: HTTP Request: HEAD https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct/resolve/main/tokenizer_config.json "HTTP/1.1 307 Temporary Redirect"
2026-05-16 14:05:46,346 INFO httpx: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/Qwen/Qwen2.5-0.5B-Instruct/7ae557604adf67be50417f59c2c2f167def9a775/tokenizer_config.json "HTTP/1.1 200 OK"
2026-05-16 14:05:46,515 INFO httpx: HTTP Request: GET https://huggingface.co/api/models/Qwen/Qwen2.5-0.5B-Instruct/tree/main/additional_chat_templates?recursive=false&expand=false "HTTP/1.1 404 Not Found"
2026-05-16 14:05:46,681 INFO httpx: HTTP Request: GET https://huggingface.co/api/models/Qwen/Qwen2.5-0.5B-Instruct/tree/main?recursive=true&expand=false "HTTP/1.1 200 OK"
2026-05-16 14:05:47,335 INFO httpx: HTTP Request: GET https://huggingface.co/api/models/Qwen/Qwen2.5-0.5B-Instruct "HTTP/1.1 200 OK"
[DEBUG-TRACE] ← OUT | func=TokenizerAdapter.__init__ | file=model/tokenizer_adapter.py:12 | return=vocab_size=151665, eos_token_id=151645
[DEBUG-TRACE] → IN  | func=HFModelAdapter.__init__ | file=model/hf_model_adapter.py:17 | args=(model_name='Qwen/Qwen2.5-0.5B-Instruct', device='auto')
2026-05-16 14:05:47,563 INFO httpx: HTTP Request: HEAD https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct/resolve/main/config.json "HTTP/1.1 307 Temporary Redirect"
2026-05-16 14:05:47,571 INFO httpx: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/Qwen/Qwen2.5-0.5B-Instruct/7ae557604adf67be50417f59c2c2f167def9a775/config.json "HTTP/1.1 200 OK"
[transformers] `torch_dtype` is deprecated! Use `dtype` instead!
Loading weights: 100%|███████████████████████████████████████| 290/290 [00:00<00:00, 2207.90it/s]
2026-05-16 14:05:50,317 INFO httpx: HTTP Request: HEAD https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct/resolve/main/generation_config.json "HTTP/1.1 307 Temporary Redirect"
2026-05-16 14:05:50,324 INFO httpx: HTTP Request: HEAD https://huggingface.co/api/resolve-cache/models/Qwen/Qwen2.5-0.5B-Instruct/7ae557604adf67be50417f59c2c2f167def9a775/generation_config.json "HTTP/1.1 200 OK"
2026-05-16 14:05:50,508 INFO httpx: HTTP Request: HEAD https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct/resolve/main/custom_generate/generate.py "HTTP/1.1 404 Not Found"
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.__init__ | file=model/hf_model_adapter.py:17 | return=model loaded on device=cpu
[DEBUG-TRACE] → IN  | func=Decoder.__init__ | file=inference/decoder.py:32 | args=(model_adapter=HFModelAdapter, tokenizer_adapter=TokenizerAdapter)
[DEBUG-TRACE] ← OUT | func=Decoder.__init__ | file=inference/decoder.py:32 | return=None
2026-05-16 14:05:50,511 INFO local_llm_runtime.api.dependencies: Model loaded: Qwen/Qwen2.5-0.5B-Instruct
[DEBUG-TRACE] ← OUT | func=RuntimeState.load_model | file=api/dependencies.py:35 | return=model loaded: 'Qwen/Qwen2.5-0.5B-Instruct'
[DEBUG-TRACE] ← OUT | func=RuntimeState.ensure_loaded | file=api/dependencies.py:54 | return=model ready: 'Qwen/Qwen2.5-0.5B-Instruct'
[DEBUG-TRACE] → IN  | func=_build_generation_config | file=api/routes.py:52 | args=(req.model='local', req.temperature=None, req.max_output_tokens=None)
[DEBUG-TRACE]   ↔   | func=GenerationConfig.normalize_sampling | file=inference/generation_config.py:20 | args=(temperature=0.0, do_sample=False, decoding_strategy=greedy)
[DEBUG-TRACE]   ↔   | func=GenerationConfig.normalize_sampling | file=inference/generation_config.py:20 | return=(do_sample=False, decoding_strategy=greedy)
[DEBUG-TRACE] ← OUT | func=_build_generation_config | file=api/routes.py:52 | return=GenerationConfig(max_new_tokens=64, temperature=0.0, top_k=None, top_p=None, repetition_penalty=None, do_sample=False, stop=[], eos_token_id=None, pad_token_id=None, seed=None, decoding_strategy='greedy', min_new_tokens=0)
[DEBUG-TRACE] → IN  | func=_flatten_input | file=api/routes.py:72 | args=(req.input='Hello')
[DEBUG-TRACE] ← OUT | func=_flatten_input | file=api/routes.py:72 | return='Hello'
[DEBUG-TRACE] → IN  | func=Decoder.generate | file=inference/decoder.py:36 | args=(prompt='Hello', config=GenerationConfig(max_new_tokens=64, temperature=0.0, top_k=None, top_p=None, repetition_penalty=None, do_sample=False, stop=[], eos_token_id=None, pad_token_id=None, seed=None, decoding_strategy='greedy', min_new_tokens=0), stream_callback=None)
[DEBUG-TRACE] → IN  | func=TokenizerAdapter.encode | file=model/tokenizer_adapter.py:18 | args=(text='Hello'...)
[DEBUG-TRACE] ← OUT | func=TokenizerAdapter.encode | file=model/tokenizer_adapter.py:18 | return=shape=torch.Size([1, 1])
[DEBUG-TRACE] → IN  | func=Sampler.__post_init__ | file=inference/sampler.py:13 | args=(do_sample=False, seed=None)
[DEBUG-TRACE] ← OUT | func=Sampler.__post_init__ | file=inference/sampler.py:13 | return=None
[DEBUG-TRACE] → IN  | func=build_default_processors | file=inference/logits_processors.py:115 | args=(temperature=0.0, top_k=None, top_p=None, repetition_penalty=None, min_new_tokens=0)
[DEBUG-TRACE] ← OUT | func=build_default_processors | file=inference/logits_processors.py:115 | return=processors_count=0, types=[]
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=False)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=271)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=271, buffer_len=0)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=1 | return='\n\n'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=40)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=40, buffer_len=1)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=2 | return='\n\nI'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=1079)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=1079, buffer_len=2)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=3 | return='\n\nI am'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=4460)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=4460, buffer_len=3)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=4 | return='\n\nI am trying'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=311)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=311, buffer_len=4)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=5 | return='\n\nI am trying to'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=1855)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=1855, buffer_len=5)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=6 | return='\n\nI am trying to create'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=264)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=264, buffer_len=6)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=7 | return='\n\nI am trying to create a'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=2025)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=2025, buffer_len=7)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=8 | return='\n\nI am trying to create a program'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=429)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=429, buffer_len=8)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=9 | return='\n\nI am trying to create a program that'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=646)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=646, buffer_len=9)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=10 | return='\n\nI am trying to create a program that can'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=1477)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=1477, buffer_len=10)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=11 | return='\n\nI am trying to create a program that can find'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=279)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=279, buffer_len=11)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=12 | return='\n\nI am trying to create a program that can find the'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=7192)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=7192, buffer_len=12)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=13 | return='\n\nI am trying to create a program that can find the maximum'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=897)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=897, buffer_len=13)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=14 | return='\n\nI am trying to create a program that can find the maximum value'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=304)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=304, buffer_len=14)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=15 | return='\n\nI am trying to create a program that can find the maximum value in'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=264)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=264, buffer_len=15)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=16 | return='\n\nI am trying to create a program that can find the maximum value in a'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=2661)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=2661, buffer_len=16)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=17 | return='\n\nI am trying to create a program that can find the maximum value in a given'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=1140)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=1140, buffer_len=17)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=18 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=315)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=315, buffer_len=18)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=19 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=5109)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=5109, buffer_len=19)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=20 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=13)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=13, buffer_len=20)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=21 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=2980)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=2980, buffer_len=21)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=22 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=498)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=498, buffer_len=22)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=23 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=3410)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=3410, buffer_len=23)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=24 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=264)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=264, buffer_len=24)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=25 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=6291)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=6291, buffer_len=25)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=26 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=369)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=369, buffer_len=26)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=27 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=419)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=419, buffer_len=27)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=28 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=3491)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=3491, buffer_len=28)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=29 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=30)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=30, buffer_len=29)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=30 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=22555)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=22555, buffer_len=30)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=31 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=11)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=11, buffer_len=31)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=32 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=358)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=358, buffer_len=32)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=33 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=646)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=646, buffer_len=33)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=34 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=1492)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=1492, buffer_len=34)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=35 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=498)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=498, buffer_len=35)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=36 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=448)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=448, buffer_len=36)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=37 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=429)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=429, buffer_len=37)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=38 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=0)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=0, buffer_len=38)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=39 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=5692)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=5692, buffer_len=39)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=40 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=594)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=594, buffer_len=40)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=41 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=264)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=264, buffer_len=41)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=42 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=13027)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=13027, buffer_len=42)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=43 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=2038)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=2038, buffer_len=43)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=44 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=43065)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=43065, buffer_len=44)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=45 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=429)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=429, buffer_len=45)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=46 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=1265)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=1265, buffer_len=46)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=47 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=653)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=653, buffer_len=47)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=48 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=1128)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=1128, buffer_len=48)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=49 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=498)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=498, buffer_len=49)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=50 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=1184)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=1184, buffer_len=50)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=51 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=1447)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=1447, buffer_len=51)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=52 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=73594)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=73594, buffer_len=52)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=53 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=12669)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=12669, buffer_len=53)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=54 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=198)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=198, buffer_len=54)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=55 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=750)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=750, buffer_len=55)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=56 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=1477)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=1477, buffer_len=56)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=57 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=6345)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=6345, buffer_len=57)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=58 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=47207)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=47207, buffer_len=58)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=59 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=982)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=982, buffer_len=59)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=60 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=262)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=262, buffer_len=60)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=61 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=3190)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=3190, buffer_len=61)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=62 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=262)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=262, buffer_len=62)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=63 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] → IN  | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | args=(input_ids.shape=torch.Size([1, 1]), use_cache=True, has_past_kv=True)
[DEBUG-TRACE] ← OUT | func=HFModelAdapter.next_token_logits | file=model/hf_model_adapter.py:64 | return=(logits.shape=torch.Size([1, 151936]))
[DEBUG-TRACE] → IN  | func=Sampler.sample | file=inference/sampler.py:19 | args=(logits.shape=torch.Size([1, 151936]), do_sample=False)
[DEBUG-TRACE] ← OUT | func=Sampler.sample | file=inference/sampler.py:19 | return=(greedy, token_id=1096)
[DEBUG-TRACE]   ↔   | func=TokenBuffer.append | file=inference/token_buffer.py:11 | args=(token_id=1096, buffer_len=63)
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=64 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE]   ↔   | func=TokenizerAdapter.decode | file=model/tokenizer_adapter.py:26 | tokens_count=64 | return='\n\nI am trying to create a program that can find the maximum value in a given lis'
[DEBUG-TRACE] ← OUT | func=Decoder.generate | file=inference/decoder.py:36 | return=(text='\n\nI am trying to create a program that can find the maximum value in a given list of numbers. Can yo', prompt_tokens=1, gen_tokens=64)
[DEBUG-TRACE] ← OUT | func=create_response | file=api/routes.py:115 | return=(id=resp_2a42be27144f, output_text='\n\nI am trying to create a program that can find the maximum value in a given list of numbers. Can yo')
INFO:     127.0.0.1:42440 - "POST /v1/responses HTTP/1.1" 200 OK
62.1k
128.0k

これの内容をAIに解析させ、結局どういう経路なの？を解説させてみたところ、
こういった結果となった。

1. サーバー起動フェーズ
python -m local_llm_runtime.api --port 8000 でサーバーを起動
デバッグトレーシングが有効化され、全関数の入出力が記録される設定
main() (api/server.py) → Uvicornを 0.0.0.0:8000 で起動
create_app() (api/app.py) → FastAPIアプリケーションを作成（OpenAI Responses API互換）
Uvicornが正常に起動完了

2. リクエスト受信フェーズ
POST /v1/responses にリクエストが到着
create_response() が呼ばれる（model='local', input='Hello', stream=False）

3. モデルロードフェーズ（遅延ロード）
RuntimeState.ensure_loaded() → モデルが未ロードなのでロード開始
Qwen/Qwen2.5-0.5B-Instruct（0.5Bパラメータの小型モデル）をHugging Face Hubからロード
TokenizerAdapter.__init__ → トークナイザーをダウンロード/初期化（vocab_size=151665）
HFModelAdapter.__init__ → モデル重み（290ファイル）をロード、デバイスはCPU
Decoder.__init__ → デコーダー（推論エンジン）を初期化
ロード完了まで約7秒（14:05:43〜14:05:50）

4. 生成設定フェーズ
_build_generation_config() → 生成パラメータを構築
max_new_tokens=64, temperature=0.0（greedy decoding）, サンプリングなし
_flatten_input() → 入力テキスト 'Hello' をそのまま使用

5. トークン生成フェーズ（自己回帰デコーディング）
Decoder.generate() が開始
入力 'Hello' を1トークンにエンコード（shape=[1,1]）
Greedy Decodingで1トークンずつ64トークン生成：
各ステップで:
HFModelAdapter.next_token_logits() → モデルのforward passでlogits取得（shape=[1, 151936]）
Sampler.sample() → greedy（argmax）でトークンID選択
TokenBuffer.append() → バッファに追加
TokenizerAdapter.decode() → 現在のバッファをテキストにデコード
KVキャッシュを使用（has_past_kv=True）→ 2トークン目以降は前のKVを再利用して効率化
生成されたテキスト: "\n\nI am trying to create a program that can find the maximum value in a given list of numbers. Can yo"（64トークンで打ち切り）

6. レスポンス返却
create_response() がレスポンスを構築（id=resp_2a42be27144f）
HTTP 200 OK で返却
アーキテクチャの特徴
OpenAI Responses API互換: /v1/responses エンドポイント
遅延ロード: 初回リクエスト時にモデルをロード
自前のデコーダー実装: HuggingFaceのgenerate()を使わず、トークン単位で制御
KVキャッシュ活用: 各ステップでinput_ids.shape=[1,1]（新トークンのみ入力）
詳細なデバッグトレース: 全関数の入出力を記録する仕組み

ちなみにログのほうは、後半はほとんど同じ文字列が並んでいることにお気づきだろうか？これはどうやら、5番の「トークン生成フェーズ」なのである。ここがモデルによる生成の本体となる。

そして、今回実は皆さんに伝えたかったのはここではない。私がずっと気になっていたのは「じゃあAIが考えているように見えるのはどこでやっているのか？」ということ。gpt3.5とか4の時代であれば、そんな複雑なことはしておらず、単純に確率モデルで次のトークンを予測しているものが、人間の目からすると思考しているように見えるだけ、というのは知っていた。実際、当時のモデルは数学の質問をしても適当な証明をしてくるし、動かないコードや間違った結論を平気で出してきていて、まあ確率モデルなんだろうな、といのも肌感で納得できた。ただ、最近のモデルは本当に性能が上がり、人間の仕事を奪うというのも、エンジニアの世界ではかなり現実になってきているものと思う。そんなAIに対して私が思っていた印象というのは、「何か推論をするためのルールベースのロジックなどを入れているのではないか？」というものである。すごく雑なイメージ図で言うと下記のようなものである。

ただAIに対してこの質問をしたところ、全く逆の答えが返ってきた

推論（ニューラルネットワークのforward pass）は 2つのファイルが連携して 実行されています：

1. inference/decoder.py — 推論ループ（オーケストレーター）
Decoder.generate() メソッド（42行目〜）が推論のメインループです。ここが全体を制御しています：


for step in range(config.max_new_tokens):  # 最大64回ループ
    # ① モデルにforward passさせてlogitsを取得
    logits, past_key_values = self.model_adapter.next_token_logits(
        input_ids=input_ids,
        past_key_values=past_key_values,
        use_cache=True,
    )
    # ② logits processorで加工（今回は0個なのでスキップ）
    for processor in processors:
        logits = processor(logits, context)
    # ③ サンプリングで次トークンを決定
    next_token = sampler.sample(logits)
    # ④ バッファに追加、次のステップへ
    buffer.append(next_token_id)
    input_ids = torch.tensor([[next_token_id]])  # 次はこの1トークンだけ入力
2. model/hf_model_adapter.py — 実際のニューラルネット計算
HFModelAdapter.next_token_logits() メソッド（70行目〜）が実際のGPU/CPU計算を行う場所です：


@torch.no_grad()
def next_token_logits(self, input_ids, ...):
    # ↓ ここがTransformerモデルの実際のforward pass（推論の本体）
    outputs = self.forward(input_ids, past_key_values=past_key_values, use_cache=True)
    # 最後のトークン位置のlogitsだけ取り出す
    logits = outputs.logits[:, -1, :]  # shape: [1, 151936]（語彙サイズ分の確率分布）
    return logits, outputs.past_key_values
内部の self.forward() は self.model(input_ids=...) を呼び、これが HuggingFaceのQwen2.5-0.5B-Instructモデル本体（Transformerの全レイヤー）を実行します。

3. inference/sampler.py — トークン選択
logitsから次のトークンを決定します。今回は temperature=0.0 なので greedy decoding（argmax）：


def sample(self, logits):
    if not self.do_sample:  # greedy
        return torch.argmax(logits, dim=-1, keepdim=True)  # 最も確率の高いトークンを選択
まとめ：推論の流れ

Decoder.generate() [decoder.py:70]  ← ループ制御
    ↓
HFModelAdapter.next_token_logits() [hf_model_adapter.py:70]
    ↓
HFModelAdapter.forward() [hf_model_adapter.py:52]
    ↓
self.model(input_ids=...) ← ★ここが本当の推論（Transformerのforward pass）
    ↓
logits (shape=[1, 151936]) が返る
    ↓
Sampler.sample() [sampler.py:25]  ← argmaxでトークンID決定
    ↓
次のトークンが確定 → ループの次のステップへ
つまり、推論の計算本体は hf_model_adapter.py の62行目 self.model(input_ids=...) で、PyTorchのTransformerモデルがforward passを実行する瞬間です。それを64回繰り返すループが decoder.py にあります。HuggingFaceの model.generate() は意図的に使わず、自前でトークン単位のデコーディングを実装しているのがこのプロジェクトの特徴です。

そう、実はgpt3.5の時代からやっていることはさほど変わっておらず、推論推論といわれているものも、結局は確率モデルによる生成なのである。現代で精度が上がっているのは、①単純にパラメーター数やインプット・アウトプットできるトークン数が増えた　②ツールによる外部検索を使用できるようになった　③強化学習によるフィードバックを強化した　の3点だけであった。つまり、結局やっていることは、先ほどのイメージ図で言うと、下記のようなものにすぎない。

短い答え：基本的な機構（次トークン予測）は同じ。でも「ループのさせ方」が進化している。
1. Chain-of-Thought（思考の連鎖）— 一番シンプルな方法
ChatGPTやClaudeが「考えてるように見える」のは、実はテキストとして思考過程を生成しているだけです：


ユーザー: 17 × 23 は？

モデルの生成（次トークン予測の繰り返し）:
"まず17×20=340、次に17×3=51、340+51=391。答えは391です。"
これは特別な機構ではなく、学習データに「ステップバイステップで考える例」が大量にあったので、モデルがそのパターンを学習しただけです。このコードの Decoder.generate() のループをそのまま回せば同じことが起きます。

2. Self-Consistency / Best-of-N — 複数回生成して評価

同じ質問を5回生成（temperature > 0でランダム性を入れる）
  → 回答A: "391"
  → 回答B: "391" 
  → 回答C: "401"  ← 外れ
  → 回答D: "391"
  → 回答E: "391"
多数決 → "391" が正解
これは同じ Decoder.generate() を複数回呼ぶだけです。機構は全く同じで、ループを複数回回して多数決を取っているだけ。

3. Reward Model（報酬モデル）による評価 — OpenAI o1/o3系
ここが一番「評価してる」に近い部分です：


① メインモデルが回答候補を複数生成（次トークン予測の繰り返し × N回）
② 別のモデル（Reward Model）が各候補にスコアをつける
   → これも実は「次トークン予測」と同じTransformerで、
     入力に対して「良さスコア」を出力するよう学習されたもの
③ 一番スコアが高い回答を採用
Reward Modelも中身はTransformerのforward passです。ただし出力が「次の単語の確率」ではなく「この回答の品質スコア（数値1つ）」になるよう学習されています。

4. OpenAI o1/o3 の「推論時計算（Test-time Compute）」
最近話題のo1/o3系は、もう少し複雑です：


① 問題を受け取る
② 内部で「思考トークン」を大量に生成（ユーザーには見えない）
   → "この問題は...まず...いや違う...こう考えると..."
   → これ自体は次トークン予測の繰り返し
③ 途中で「この方向は間違ってそう」と判断して別の方向へ
   → これもモデルが「wait, let me reconsider」的なトークンを生成するだけ
④ 最終的な回答を出力
ポイントは、「間違いに気づく」「方向転換する」も全てテキスト生成として実現されているということです。特別な「ロジックチェッカー」モジュールがあるわけではなく、「自分の出力を振り返って修正するパターン」を強化学習（RLHF/RLVR）で叩き込んでいるのです。

5. ツール使用（Code Interpreter等）
唯一「本当に外部でチェックしている」のがこれ：


① モデルが「Pythonコードを実行して確認しよう」というトークンを生成
② 実際にPythonインタプリタでコードを実行（← これはモデル外の処理）
③ 実行結果をモデルに戻す
④ モデルが結果を見て次のトークンを生成
これだけはモデルの外に実際のロジックチェック機構があります。

そうなると気になるのは、reasoning levelである。例えばChatGPTでも、最近は「即答する」「深く考える」などの選択ができる。これはいったい何なのか、これもAIに確認してみた

核心：「思考トークンの上限を変えているだけ」
あなたのコードに当てはめると、こういうことです：


# あなたのコード (decoder.py:70)
for step in range(config.max_new_tokens):  # ← ここの数字を変えるだけ
    logits = model.next_token_logits(input_ids)
    next_token = sampler.sample(logits)
Reasoning Level	思考トークン上限（推定）	イメージ
Low	数百〜1,000トークン	「ざっと考えて答える」
Medium	数千トークン	「しっかり考えて答える」
High	数万トークン	「徹底的に考え抜いて答える」
具体的に何が起きているか
Low（推論レベル低）:


[思考トークン: 上限500]
"この質問は...答えは42です。"
→ 思考が短いので速い・安い、でも間違いやすい

[公開される回答]
"答えは42です。"
High（推論レベル高）:


[思考トークン: 上限30,000]
"この質問を分析すると...
 まずアプローチ1を試す...結果はX...
 いや、これには問題がある。なぜなら...
 アプローチ2を試す...結果はY...
 検算: Yを元の条件に代入すると...合っている。
 念のためアプローチ3でも確認...やはりY。
 答えはYで確定。"
→ 思考が長いので遅い・高い、でも正確

[公開される回答]
"答えはYです。理由は..."

驚くほど単純であった。扱えるトークンが多くなると、生成結果のなかでグルグルとああでもないこうでもないと思考しているような生成物が出てきて、それが更に深い生成結果を出しているということなのである。

これを踏まえてAIの使い方をどうするか

内部構造はわかった、がじゃあそれを踏まえてAIを取扱う際にどうするか？はまた別の話であり、その点について私の見解を述べたい。
結局のところ、AIというのは確率モデルによるトークン予測であるということは、今も昔も変わっていないということがわかった。そうなると、逆に何が言えるかというと、「どれだけモデルが精密になったからといって、AIにインプットするプロンプトを雑にしてはいけない」ということである。

結局独立した検証の仕組みを持たないのがAIであり、限られたトークンの中でグルグル思考して生成するのが内部構造ということになると、その限られたトークンをいかに有効に使うかがミソになってくる。
つまりやはり、元々言われていたように、AIに対しては　①余計な情報やノイズは入れない　②論理的に整理し、指示を単純化して伝える　③AIが一度に理解できる情報量を食わせる　④指示を複数詰め込まず、1回に1指示とする　というようなものが必要で、これは今も昔も変わらないのだという結論に至る。

例えば私はこれから、とあるデータ基盤のジョブ定義を他の基盤に移行するためのエージェントを書こうとしているが、その際も、タスクを細かく細分化させたり、長いインプットデータを要約することが重要と感じている。

終わりに

LLMの進化は直近めざましく、技術的な進歩に知識が追い付かないことがあるが、やはりこれからの時代AIを中核に据えてビジネスを展開する場面も多くなってくるため、技術的な実験というのは今後も是非続けていきたいところである。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up