Multiplayer desync

Bug #1811030 reported by _aD
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
widelands
Fix Released
Undecided
Unassigned

Bug Description

Windows 7 64-bit using bzr8955 and Ubuntu LTS 18 using bzr8956. (The difference between revisions is a Barbarian campaign change; the desync was also present with Win 8942 & Ubuntu 8950). Map: Encounters.

We've had several desyncs with this session and I think that it may be related to warfare. Map is attached, as is a save which reliably desyncs after a second or two.

Tags: desync

Related branches

Revision history for this message
_aD (ad-simplypeachy) wrote :
Revision history for this message
_aD (ad-simplypeachy) wrote :
Revision history for this message
_aD (ad-simplypeachy) wrote :
Revision history for this message
_aD (ad-simplypeachy) wrote :
Revision history for this message
kaputtnik (franku) wrote :

From what i know about desyncs, all wss files are needed to investigate this, so yours and the one from the other player. Can you provide it?

Not sure if this should be marked as duplicate of https://bugs.launchpad.net/widelands/+bug/1721126

Revision history for this message
_aD (ad-simplypeachy) wrote :

Thanks for the tip. I read through the outstanding desync reports and the descriptions seemed different enough to warrant a new report, particularly since this was playing on Autocrat and the possibility of it being triggered by military actions. I'll get hold of the other player's .wss files and upload them.

Revision history for this message
kaputtnik (franku) wrote :

Did one player had problems with the internet connection?

From the server log it looks like one client had lost connection or similar, then the desync happen and than the game crashed:

[Host]: Client 0 (damia) hung
[Host]: 1 clients hung. Entering wait mode
ComputerPlayer(2): initializing as type 2
    ... DNA initialized
  2: 0 basic buildings in savegame file.
 2: expedition max duration = 8346 (139 minutes), map area root: 199
  2: Starting preparation for expedition in port at 82x 10
[Host]: Client 0: Time 27158685
[Host]: Client 0 reports time 27158685 (networktime = 27163569) during hang
[Host]: Client 0 (damia) hung
[Host]: comparing syncreports for time 27158685
[Host]: lost synchronization with client 0!
I have: beb0cdcfc333811ca8b53e6cf0514abd
Client has: da82b1a5156042cca5bcd480bda38a8d
FATAL ERROR - game crashed. Attempting emergency save.

Later on it looks similar:

InternetGaming: Client announced the disconnect from the game peachygame.
...
InternetGaming: Client announced the disconnect from the game downypeach.
[Client]: disconnect(CLIENT_LEFT_GAME, )
[NetRelayConnection] Closing network socket connected to ...

Revision history for this message
Alasia (alasia) wrote :

Hi there, I'm the other player (damia), and my files are huge even after compression - apologies for that!

We had significant trouble with this one and we were able to reproduce the desync reliably whether I was the server or _aD was. We send our files outside of the game because it's significantly faster.

This is the only one so far we have tried multiple ways over different sessions to resurrect - each time the attack in question occurs, the desync is very close behind.

We do appreciate the difference between AU and the UK may cause issues so we've been careful to check that out by restarting from the save to check for that.

Thanks for looking into it :o)

Revision history for this message
Notabilis (notabilis27) wrote :

I took a look at the *.wss files created when continuing the savegame. It desyncs for me and the bug seems to be the global pseudo-random number generator in combination with the AI. Am I right that the savegame contains one human player and one AI?

Short theory lesson: The random number generator (RNG) uses its current internal state and generates a new number based on this state, while at the same time changing the state. When the initial state is the same (e.g., directly after the game is loaded), the same sequence of numbers is generated. on both host and client computer.

In the save game, the AI is using the global RNG to generate a random number for its ship. However, this is only done on the host computer (that is fine), so the client does not call the RNG at that time. This leads to different random numbers being generated in the following for other "random" actions (here we have a problem), e.g., the movement of deers (e.g., it could happen that the deer walks left on the host but right on the client). Surprisingly, this works quite long until finally something changes in the game and the computers desync. In your case it could be the fight between two soldiers that turns out differently on both computers, leading to the desync (e.g., 2 damage received on the host but 5 damage on the client).

I haven't tested it yet with a new game but if I am right the presence of an AI player will lead to desyncs sooner or later. Continuing your savegame with two humans players works fine without an immediate desync.
An untested, probable fix would be to use an own RNG for the AIs since they are only calculated on the host and the random numbers for the AI does not need to be synchronized with the client(s).

I haven't followed the changes that have been done to the AI since build 19: Can it be that the global RNG is a new addition to the AI code and a local RNG had been used in previous versions?

Revision history for this message
_aD (ad-simplypeachy) wrote :

Fascinating! We have had games of many hours, all with AI present, including this one. Some games had no desyncs, some had one or two, but this game seemed to have them very often. The vicissitudes of the RNG are strong :-)

None of the games we played with build 19 had desync problems, although we have played that version less often.

Revision history for this message
kaputtnik (franku) wrote :

Thanks for your analysis Notabilis!

Since build 19 the ai code has been exchanged completely, using a genetic algorithm:

https://wl.widelands.org/forum/topic/2646/

There are some network related notes in https://bazaar.launchpad.net/~widelands-dev/widelands/trunk/view/head:/src/ai/defaultai.h#L58

Grep'ping for 'random' in src/ai yields also some more results.

Revision history for this message
Notabilis (notabilis27) wrote :

Thanks for the pointers! The note regarding networking is a good find since it confirms that the AI is only run on the host -> logic_rand() shouldn't be used.

From the grep-result most of the AI is using the "normal" std::rand() function that is computer-local. The exception is the seafaring code where the global logic_rand() method is used, leading to desyncs. So games with AIs are fine as long as the AI is not using ships. _aD, can you confirm that the desyncs are only (or mostly?) in maps with seafaring?

I will prepare a branch which replaces the logic_rand() with std::rand() for the AI, which hopefully fixes this issue.

Revision history for this message
kaputtnik (franku) wrote :

Your finding correlates to comment 15 in this bugreport: https://bugs.launchpad.net/widelands/+bug/1797549

Although it mentioned only the ships name is different between host and client...

Revision history for this message
Toni Förster (stonerl) wrote :

GunChleoc already tackled the problems with ships, as reported in bug 1800338 (as far as I remeber no AI where invloved, but don't quote me on that).

Here is the branch that was merged:

https://code.launchpad.net/~widelands-dev/widelands/terrain_affinity_as_int/+merge/358299

As far as I can recall, one major problem was that most values were handled by floating-point numbers. She converted many of them to integer since floating-point calculations can be different depending on platform and compiler.

Maybe there are some floating-point operations left that will have different results and lead to a desync? The desync with the ships was definitely solved with the aforementioned branch.

kaputtnik (franku)
Changed in widelands:
milestone: none → build20-rc1
Revision history for this message
_aD (ad-simplypeachy) wrote :

We have only multiplayed on maps with seafaring.

Revision history for this message
Notabilis (notabilis27) wrote :

The problems of the other bug report(s) are unrelated to this bug I think. Of course it can be that there are still other floating point numbers around somewhere. Regarding this bug (or at least its savegame) I think it is AI related, though. Replacing the AI with a human player or using my branch (where the AI is no longer using logic_rand()) both allows to continue playing the savegame.

Hm, okay, thanks. Maybe the AI haven't used ships anyway or you just got lucky with the random numbers. Or I am simply wrong, might be as well.

Revision history for this message
_aD (ad-simplypeachy) wrote :

Notabilis, using your branch the save plays for many in-game hours without desync. \o/

kaputtnik (franku)
Changed in widelands:
status: New → Fix Committed
Revision history for this message
_aD (ad-simplypeachy) wrote :

I think this fix may have introduced a bug when playing on a (singleplayer) save from a previous version: jumping to areas sometimes crashes. I've been able to reproduce this reliably when jumping from some specific messages in the inbox and in the seafaring stats.

However, I haven't been able to reproduce this with a newly-created game using the fixed version, so it may not be a big problem.

Revision history for this message
kaputtnik (franku) wrote :

From what i remember we had several save game incompatibilities in current trunk. So if this can't be reproduced with the fixed version it'll be ok, imho.

Revision history for this message
_aD (ad-simplypeachy) wrote :

kaputtnik - agreed.

I'm afraid we're still having problems. Started a new game, both using the fixed version. I was using one via Appveyor from notabilis' pre-merge fix, alasia using the merged version. I presume this won't make a difference.

In the attached file I hosted first (simplypeachy) and alasia hosted the second time.

Revision history for this message
_aD (ad-simplypeachy) wrote :

Apologies for the spam, but we can confirm there are desyncs on maps with no seafaring (Elven Crossing).

Revision history for this message
GunChleoc (gunchleoc) wrote :

Fixed in build20-rc1

Changed in widelands:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.