crashes/desyncs in multiplayer

Bug #1721126 reported by kaputtnik
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
widelands
Fix Released
Critical
Unassigned

Bug Description

During the last multiplayer game for the tournament with version bzr8461, Hasi50 and me had several desyncs and crashes (the whole game crashed, no widelands window anymore).

The Terminal output of one crash is attached.

Close before one crash all new notifications on my system got messed up somehow: New arriving messages are added at the top, but the message window scrolls to the bottom.

This needs stabilized before build20, imho.

Related branches

Revision history for this message
kaputtnik (franku) wrote :
Changed in widelands:
milestone: none → build20-rc1
tags: added: multiplayer
kaputtnik (franku)
Changed in widelands:
importance: Undecided → Critical
Revision history for this message
SirVer (sirver) wrote : Re: [Bug 1721126] Re: crashes/desyncs in multiplayer

Can you find and provide the .wss files before the desyncs?

> Am 03.10.2017 um 22:34 schrieb kaputtnik <email address hidden>:
>
> ** Changed in: widelands
> Importance: Undecided => Critical
>
> --
> You received this bug notification because you are subscribed to
> widelands.
> https://bugs.launchpad.net/bugs/1721126
>
> Title:
> crashes/desyncs in multiplayer
>
> Status in widelands:
> New
>
> Bug description:
> During the last multiplayer game for the tournament with version
> bzr8461, Hasi50 and me had several desyncs and crashes (the whole game
> crashed, no widelands window anymore).
>
> The Terminal output of one crash is attached.
>
> Close before one crash all new notifications on my system got messed
> up somehow: New arriving messages are added at the top, but the
> message window scrolls to the bottom.
>
> This needs stabilized before build20, imho.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/widelands/+bug/1721126/+subscriptions

Revision history for this message
kaputtnik (franku) wrote :

Some are quite big, i send them separately.

Revision history for this message
kaputtnik (franku) wrote :

113 MB seems too big for launchpad... here a smaller file. Don't know anymore if this was crash or a desync.

Revision history for this message
kaputtnik (franku) wrote :

And here is the last wss file from the game. I think this was just a desync

Revision history for this message
kaputtnik (franku) wrote :

Zipped first desync

Revision history for this message
kaputtnik (franku) wrote :
Revision history for this message
kaputtnik (franku) wrote :
Revision history for this message
SirVer (sirver) wrote :

Hasi also needs to provide the WSS files and we require the savegame and ideally the replay files as well.

The differeneces in the WSS files can lead us to the desyncs - in a correct game the files should be completely equivalent.

Changed in widelands:
assignee: nobody → Klaus Halfmann (klaus-halfmann)
Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :

Attached is the Apple Crash Report:

Looks like a a meory corruption which mad the lua GC crash:

6 widelands 0x0000000102c43270 LuaGameInterface::write_global_env(FileWrite&, Widelands::MapObjectSaver&) + 48 (logic.cc:200)
7 widelands 0x0000000102a99eda Widelands::MapScriptingPacket::write(FileSystem&, Widelands::EditorGameBase&, Widelands::MapObjectSaver&) + 2106 (map_scripting_packet.cc:92)
8 widelands 0x0000000102a984bb Widelands::MapSaver::save()

Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :

Here are the repaly including the .wss files

Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :

I may try to reproduce it using a local win/mac setting.
I have almost no time these days and other things have
higher priority, sorry

Revision history for this message
kaputtnik (franku) wrote :

During the game (Territorial Time) we suspect the status message to cause crashs.

The later the game the often desyncs and crashs happend, it seems to me.

Maybe this the same issue as with bug 1651591 "Atlanteans mission 1 needs a lot of memory" where SirVer suspects the garbage collector of Lua is the culprit?

Revision history for this message
Klaus Halfmann (klaus-halfmann) wrote :

I invested quite some time this morning into reproducing this, here my runbook what I did/found

OSX 10.13 (High Sierra)
Widelands VERSION bzr8462[trunk]
Locale: en

Windows 10 Home Version 1703 Buuidl 15963.632
Widelands VERSION master-2511 (from appvoyer)
Locale en

Save file MD5 (crashed3.wgf) = 14b301432a3ce2084fe3c3d152d6e9c8
local network

played till the end -> no issues.

MD5 (crashed2.wgf) = b26f1f9a1f6962e8e1df57a2d6eccb68
local network

played till the end -> no issues.

MD5 (crashed.wgf) = 5e46b773c6cef7caa0242a42518b3c72

[Host]: comparing syncreports for time 13804274
[Host]: lost synchronization with client 0!
I have: cdb97e3a653646bc54a36a4cc609c559
Client has: f3936db700e84c264edf9c26191a07bd
[Host]: disconnect_player_controller(1, HasiWin)
[Host]: disconnect_client(0, CLIENT_DESYNCED, )
[NetHost] Closing network connection to 192.168.254.23:51904.
[Host]: disconnect_client(0, CONNECTION_LOST, )
ComputerPlayer(2): initializing as type 2
    ... DNA initialized
  2: 0 basic buildings in savegame file.
 2: expedition max duration = 10350 (172 minutes), map area root: 256
lastserial: 0

this desync happend with the 10 minutes report- I played mosty on
the mac side leaving teh windows side alone (except for keeping the
screensaver from kicking in). I was called on the phone an when I
came bac the mac side was very slow. I assume some code (lua GC)
was so slow that the games got out of sync.

as both version where debug version I assuem they both have a (huge) sync stream.
SirVer: what files exatcly woud that be?

In replays I got:
-rw-r--r-- 1 769344 8 Okt 09:27 2017-10-08T09.27.05_nethost.wrpl.wgf
-rw-r--r-- 1 139659395 8 Okt 10:36 2017-10-08T09.27.05_nethost.wrpl.wss
-rw-r--r-- 1 465307 8 Okt 10:36 2017-10-08T09.27.05_nethost.wrpl

I guess we would need the same .wss from the windwos side?

I will stop for today, next time I will let them run wit say 50x speed from that
starting point.

And I think the Nile is the biggest Map size we should allow for now :-)

Revision history for this message
SirVer (sirver) wrote :

> I assume some code (lua GC) was so slow that the games got out of sync.

This cannot happen. Widelands logic is always in sync, always running the excact same code. Lua GC is also deterministic (and should also not affect logic). If a computer is too slow to keep up with simulating, the game will stutter for all other players.

A desync always means a programming error - that two computers did run two different code paths, indicating a non-determinism somewhere in the code. For example a std::set<Worker*> or another ordered data structure, containing things (for example pointers) that will not the be the same on both computers, i.e. change order of logic.

the WSS files are exactly the same for both computers, if there is no desync. And they start at the state the savegame was loaded, so the .wgf + (wss from both computers) is the minimum required to debug a desync.

If there is desync, they will diverge at some timestamp. Analysying the surroundings might indicate what the problem is. There are simple tools in utils/syncstream/* to decode these files into hex.

Klaus, thanks for putting so much work into this issue. I am very interested in debugging this, but I do not know when I'll get around to it.

Revision history for this message
SirVer (sirver) wrote :

I am surprised that the wss files are so big. Tibor, is the new AI sending much more commands than the old one?

Changed in widelands:
assignee: Klaus Halfmann (klaus-halfmann) → TiborB (tiborb95)
Revision history for this message
TiborB (tiborb95) wrote :

No, I dont think it sends significantly more commands. I would say no more than 1.5-fold.

If you can point me to one most frequent command I can review the code, no problem...

Revision history for this message
TiborB (tiborb95) wrote :
Revision history for this message
SirVer (sirver) wrote :

#18: I'd say unlikely, but who knows.

tags: added: crash
Revision history for this message
GunChleoc (gunchleoc) wrote :

I remember that the font renderer will occasionally have a newline node that's not consumed with the status messages - I don't know why yet. That assert failure was introduced when I implemented dynamic width tags. I still don't really understand the fit_line/fit_node code there, so I have been unable to fix that so far. Since this was triggered by a status message, it might be related.

GunChleoc (gunchleoc)
tags: added: desync
Revision history for this message
GunChleoc (gunchleoc) wrote :

Territorial Lord is broken: https://bugs.launchpad.net/widelands/+bug/1759857

Maybe Territorial Time has the same problem?

Revision history for this message
kaputtnik (franku) wrote :

Just played a multiplayer with WorldSavior on map Two Frontiers and we had no desyncs. Setup: WorldSavior and me as red (Headquarter) against 2 AI (Fortified Village).

Revision history for this message
Notabilis (notabilis27) wrote :

Since multiple partial problems (seem to) have been solved, I guess this bug can be considered fixed?

Revision history for this message
kaputtnik (franku) wrote :

I think we should at least test a big multiplayer game (probably 4 to 6 players, some ai and seafaring)... just to be sure.

Revision history for this message
kaputtnik (franku) wrote :

Today i played a lan game on map 'Crossing the horizon' against myself and 2 AI. Both AI players managed to build a port, started expeditions and build ports on other islands. This game ran more than 3 hrs and i had no desync. Also reloading the game form a saved state run without problems.

Notabilis, i don't know if it would be important to start an internet game (instead of a lan game) to recheck that this bug is fixed. Feel free to mark this one as fixed, if you don't think starting an internet game is important :-)

Revision history for this message
kaputtnik (franku) wrote :

Played with teppo the map "crossing the horizon" with the better_desyncs branch and we got a desync after a few minutes.

My files attached here, teppos files could be found here: https://bugs.launchpad.net/widelands/+bug/1800364/comments/9

Revision history for this message
kaputtnik (franku) wrote :

Don't mark this bug as solved when merging the branch

https://code.launchpad.net/~widelands-dev/widelands/bug-1811583-desync-with-territorial

Teppo and me had a desync in autocrat, but the proposed branch does solves only desyncs with territorial win conditions.

Revision history for this message
Notabilis (notabilis27) wrote :

Just a short status report: Based on the syncstreams and some further testing it seems as if there is some desync-bug related to production sites. The entity serials in the syncstream referred to production sites and based on the timings it seems as if they are running their production programs.
My current theory is that for one player the building started to produce a ware while it haven't decided to do so at the other players computer. Unfortunately I wasn't able to (dis)proof this yet. I haven't spotted any obvious problems in (some of) the relevant code parts, though there is a lot of code since it also indirectly includes lots of economy code. Adding more debug output wasn't any help yet either, since the bug didn't occurred since I added more output.
So in the end: Some theories, but no results. :-/

Revision history for this message
GunChleoc (gunchleoc) wrote :

The desyncs seem to be gone, so I' closing this bug.

Changed in widelands:
status: New → Fix Committed
assignee: TiborB (tiborb95) → nobody
Revision history for this message
GunChleoc (gunchleoc) wrote :

Fixed in build20-rc1

Changed in widelands:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.