The frustrating RouterOS–WireGuard VPN peering bug

I’ve wanted to move my home Virtual Private Network (VPN) server from a virtual machine onto my physical MikroTik router. I use the VPN to connect back to my home network to reach internal devices and services when I’m out and about. The router runs the RouterOS operating system, which supports WireGuard, a modern VPN protocol. I wasted several afternoons and late evenings but didn’t manage to set it up. It would turn out that a bug caused all my hardship in the MikroTik web configuration interface.

The WireGuard protocol is relatively new and is an overall improvement compared to older VPN protocols. However, it’s more difficult to troubleshoot than those older protocols. The remote end of a WireGuard tunnel stays quiet unless you can successfully authenticate against it. This behavior makes it harder to distinguish an authentication error from a head-on collision with a firewall rule or other networking roadblocks. Older VPN protocols would aid troubleshooting efforts by emitting error messages when a client connected to it with incorrect credentials.

WireGuard uses the stateless User Datagram Protocol (UDP) instead of the stateful Transmission Control Protocol (TCP). This protocol difference reduces the transmission overhead. However, you also lose troubleshooting information on whether connections succeed or not.

The local WireGuard doesn’t have any idea why a connection attempt failed. It doesn’t know if it managed to connect to the remote end of the tunnel, and if it did — it doesn’t know if the authentication attempt failed. The WireGuard software only issued a generic error message: handshake did not complete after 5 seconds, retrying.

So, … what was causing my connection issues? I’d configured the WireGuard software in my router, set up the necessary routing rules, and opened the required ports in the firewall. I’d also double- and triple-checked that I’d set everything up correctly. Everything seemed fine on both ends of the VPN tunnel, but most of my road warriors couldn’t establish a VPN tunnel. A road warrior refers to any roaming remote client that is expected to connect to a trusted network.

Frustratingly, the third road warrior I had set up worked fine. It connected, authenticated, routed, and performed as expected when connecting to the VPN tunnel, both from inside and outside the local network. No other road warrior could connect in, however.

I failed to identify any differences between the road warriors that worked and those that didn’t. Two were even running the same version of the WireGuard app, on the same version of Android, on identical devices. Yet, only one of them would connect.

I wasted hours trying to troubleshoot the problem. I didn’t have much to go on as the RouterOS front-end to WireGuard doesn’t log anything either. Eventually, I stumbled upon the core issue.

The RouterOS web configuration (webfig) interface didn’t save my peer configuration properly! Whenever I added a new road warrior, I assigned it an IP address and saved its public cryptographic key in webfig. As road warriors, they aren’t expected to have fixed IP addresses, so I left the endpoint address field empty in webfig.

The empty fields incorrectly got saved as an empty value instead of no value! An empty value (null) and an empty string ("") mean very different things to software, even though humans consider a comparison between nothing and nothing to be equivalent. The underlying configuration read endpoint-address="". This behavior caused WireGuard to refuse connections from peers with a trusted key because their source IP addresses never matched the expected empty address.

The problem did not affect the MikroTik WinBox software, an alternative to the RouterOS webfig. Neither the WinBox nor webfig user interfaces showed that the configuration contained a value for WireGuard peers’ endpoint addresses. They just appeared empty, so nothing seemed to be wrong.

The final workaround for the bug was to remove the empty configuration attributes from each of my WireGuard peers. I had to use the RouterOS command line interface (CLI) — RouterOS’ third and more verbose configuration option — to get the job done correctly.

I reported the issue to MikroTik, which fixed the bug a week later in RouterOS version 7.6 (released on 2022-10-17). I’ve confirmed that the problem is fixed, so no other MikroTik customers should have to waste their time on this problem.