AlphaZero Training Metrics: Policy and Value Loss Over 13 Iterations

Reinforcement learning training run tracking policy/value losses, game length, MCTS simulations, and value calibration across 13 self-play iterations.

#	iteration	loss_policy_train	loss_value_train	loss_policy_val	loss_value_val	loss_soft_policy_train	loss_soft_policy_val	loss_aux_value_train	loss_aux_value_val	loss_aux_value_0_train	loss_aux_value_0_val	loss_aux_value_1_train	loss_aux_value_1_val	loss_aux_value_2_train	loss_aux_value_2_val	gradient_steps	game_length_avg	game_length_stddev	game_length_min	game_length_max	game_wins	game_losses	game_draws	policy_entropy_avg	policy_max_prob_avg	policy_entropy_high_branch_avg	policy_max_prob_high_branch_avg	policy_agreement_avg	policy_agreement_high_branch_avg	policy_surprise_avg	value_z_avg	value_q_avg	value_z_stddev	value_q_stddev	value_correction_avg	value_correction_high_branch_avg	value_q_spread_avg	value_q_spread_high_branch_avg	value_error_early_avg	value_error_mid_avg	value_error_late_avg	value_network_stddev	lr	q_weight	mcts_sims	replay_samples	samples_iter	time_selfplay_secs	time_train_secs
1	1	3.864787	0.320826	3.18191	0.242259	3.646052	3.076387	0.170796	0.031286	0.061016	0.005071	0.041854	0.009963	0.716512	0.133743	35	354.264	67.721033	134	418	206	257	37	1.069399	0.552025	1.627311	0.355778	0.306516	0.152918	0.900903	0.117202	-0.043852	0.755335	0.117126	0.256406	0.220012	0.110637	0.128793	0.703323	0.707441	0.717725	0.043437	0.0005	0.028333	100	178444	178444	286.273461	77.393945
2	2	3.054179	0.214878	2.736785	0.18571	2.956313	2.728005	0.025733	0.01787	0.036853	0.028107	0.038285	0.029353	0.048863	0.028961	66	363.372	59.736836	132	420	235	235	30	1.107233	0.555301	1.653229	0.386097	0.212524	0.147646	1.226714	0.125226	0.066841	0.75366	0.302264	0.161567	0.139284	0.075105	0.073412	0.623991	0.541579	0.498636	0.311493	0.0005	0.056667	197	334478	156034	311.520347	122.90891
3	3	2.473027	0.319545	2.383122	0.304262	2.611523	2.547672	0.030155	0.017883	0.043951	0.021795	0.04879	0.029612	0.051183	0.033476	42	306.762	80.571219	125	456	246	242	12	0.914362	0.649239	1.223786	0.563321	0.343803	0.29213	1.169231	0.107064	0.078501	0.913341	0.397952	0.118311	0.109984	0.055504	0.064856	0.795905	0.72426	0.626296	0.413562	0.0005	0.085	293	214466	214466	565.548518	90.27873
4	4	2.306242	0.283898	2.178379	0.285746	2.373558	2.262347	0.023122	0.027903	0.033743	0.03843	0.035401	0.045739	0.040325	0.047388	92	290.198	101.386364	99	488	223	266	11	0.636061	0.749145	0.864953	0.690746	0.414222	0.362223	1.309917	0.052395	0.030916	0.88627	0.428252	0.157282	0.157085	0.05787	0.074392	0.807009	0.671657	0.560194	0.441084	0.0005	0.113333	390	468004	253538	776.247988	171.735797
5	5	2.120492	0.266878	2.084083	0.281335	2.201041	2.173777	0.022068	0.019783	0.029671	0.025418	0.034216	0.0314	0.040508	0.03751	125	245.736	82.902221	91	448	297	197	6	0.727392	0.718925	1.036619	0.622322	0.347107	0.255754	1.275226	0.092726	0.062839	0.952357	0.48553	0.178129	0.207145	0.073739	0.089232	0.855319	0.713809	0.555246	0.479771	0.0005	0.141667	487	635428	167424	624.092566	233.373341
6	6	2.056717	0.254081	2.004317	0.244936	2.128085	2.089925	0.025475	0.023007	0.035926	0.038931	0.040308	0.031487	0.044802	0.038463	159	224.934	85.427499	84	460	227	271	2	0.507362	0.801921	0.702605	0.743406	0.388656	0.307349	1.443071	0.064813	0.037782	0.960654	0.475798	0.198953	0.211323	0.075591	0.088419	0.876541	0.746356	0.578257	0.496498	0.0005	0.17	583	810439	175011	721.00008	297.277773
7	7	1.980565	0.240194	1.962871	0.247953	2.056309	2.038054	0.024335	0.023682	0.035475	0.028279	0.036753	0.039402	0.043138	0.045329	184	199.75	63.037604	71	438	253	243	4	0.709711	0.730818	1.050833	0.624401	0.331796	0.203324	1.272419	0.052391	0.052032	0.982562	0.453649	0.125269	0.11675	0.056044	0.065202	0.901564	0.776227	0.608777	0.460589	0.0005	0.198333	680	939431	128992	609.804331	345.188357
8	8	1.934722	0.224824	1.926557	0.237877	2.009667	1.996964	0.023222	0.030667	0.031103	0.045675	0.036661	0.050567	0.042819	0.049246	209	205.662	65.041493	92	434	297	201	2	0.669525	0.741777	1.023188	0.622877	0.319223	0.180888	1.501361	0.056121	0.060203	0.982131	0.472652	0.203874	0.193068	0.091324	0.104065	0.890773	0.743353	0.584211	0.457012	0.0005	0.226667	777	1065506	126075	655.579999	390.819171
9	9	1.905073	0.210875	1.928705	0.205652	1.952473	1.973959	0.022604	0.02032	0.028725	0.02477	0.03595	0.031809	0.042882	0.040527	198	191.228	71.703305	79	462	234	265	1	0.450463	0.823819	0.639126	0.766002	0.38914	0.288425	1.395133	0.044485	0.032008	0.976307	0.470015	0.16768	0.161381	0.074023	0.086891	0.903919	0.743153	0.570822	0.480495	0.0005	0.255	873	1011396	160356	953.552464	371.128067
10	10	1.885457	0.202162	1.930772	0.194228	1.936023	1.970428	0.023354	0.021064	0.028933	0.025361	0.037461	0.033421	0.044752	0.041666	174	183.046	61.119096	70	460	255	240	5	0.657052	0.752794	1.023239	0.635632	0.362461	0.230714	1.293476	0.062379	0.061959	0.981138	0.480829	0.122319	0.116329	0.05668	0.06547	0.893507	0.749269	0.558663	0.493849	0.0005	0.283333	970	888992	131134	815.300974	326.27391
11	11	1.845871	0.191741	1.825237	0.184761	1.896899	1.87675	0.02634	0.023318	0.03383	0.02774	0.042235	0.03806	0.049186	0.044817	172	190.236	63.771156	70	456	274	226	0	0.661495	0.752031	1.068277	0.625464	0.433967	0.306839	1.176193	0.103805	0.096743	0.980855	0.551427	0.096549	0.095214	0.046351	0.053343	0.853389	0.680149	0.469069	0.555274	0.0005	0.311667	1067	878536	156968	1076.774026	322.572067
12	12	1.769169	0.18186	1.772084	0.176119	1.83411	1.844026	0.028092	0.024488	0.036672	0.028815	0.045173	0.039682	0.051839	0.047161	166	152.794	46.067033	73	468	249	250	1	0.535325	0.799832	0.891145	0.685752	0.475168	0.291591	1.142953	0.106931	0.078072	0.989762	0.506	0.09566	0.093859	0.051289	0.056117	0.880376	0.72019	0.516044	0.519486	0.0005	0.34	1163	848198	144673	1061.730934	311.39659
13	13	1.743754	0.158582	1.726435	0.153854	1.81128	1.799296	0.025183	0.021926	0.030586	0.025736	0.040735	0.035313	0.04855	0.042916	168	153.432	51.590012	68	478	231	268	1	0.579754	0.784928	0.941172	0.672857	0.472446	0.284693	1.125724	0.159865	0.117188	0.97555	0.534392	0.094677	0.101375	0.043354	0.04914	0.861803	0.653665	0.446512	0.536952	0.0005	0.368333	1260	857260	138054	1159.378612	314.991295

1–13 of 13 Rows per page:

1 / 1

Correlations 5 Columns that move together (r² > 0.25)

Outliers 5 Values > 3σ from the mean

No pending changes.
Double-click a cell to edit.

AlphaZero Training Metrics: Policy and Value Loss Over 13 Iterations — AI Analysis

Training Summary

Policy loss fell 55% (3.86 → 1.74) with train and validation tracking closely — no sign of overfitting
Average game length dropped from 354 to 153 moves as the agent learned to finish games decisively; draws went from 37 to just 1
Policy agreement with MCTS rose from 31% to 47%, meaning the network increasingly agrees with tree search but still overrides it half the time
Late-game value error improved 38% (0.72 → 0.45) while early-game error stayed flat at ~0.86 — the model evaluates endgames well but struggles with openings
MCTS simulations scaled from 100 to 1,260 per move across the run, giving the search tree progressively more budget

Visualizations

Value Error by Game Phase

Average Game Length

Value Loss (Train vs Validation)

Policy Loss (Train vs Validation)

Opening Weakness

Early-game value error actually increased slightly (0.70 → 0.86) over training, even as mid and late-game errors fell. This suggests the model is learning tactical play but not positional understanding in the opening phase — a common pattern in AlphaZero-style training that typically resolves with more iterations.

Alphazero Training Metrics Policy Value Loss 13 Iterations

AlphaZero Training Metrics: Policy and Value Loss Over 13 Iterations

Insights

AlphaZero Training Metrics: Policy and Value Loss Over 13 Iterations — AI Analysis

Training Summary

Visualizations

Opening Weakness

Embed this data story

Related Datasets

AlphaZero-Style RL Training Metrics (13 Iterations)

AlphaZero-Style Game Agent Training Metrics (13 Iterations)

Catan AI Self-Play Training Metrics (171 Iterations)

AlphaZero-Style Training Run (177 Iterations)

AlphaZero-Style Self-Play Training Metrics (177 Iterations)

AlphaZero-Style Training Metrics (177 Iterations)