Skip to content

Missing Subscribes in a cluster set up? #2457

@ioolkos

Description

@ioolkos

Issue Description

Cluster set up, where certain MQTT clients can connect to 'poisened' nodes where they have their subscriptions not added to the RAM tables. The subscriptions are still on disk, the client can publish as well. If the client connects to another cluster node, it'll have the subscriptions.

We can verify that we have the subscriber on disk:

vmq_subscriber_db:read({[], <<"jules">>}).
[{'[email protected]',true,
                      [{[<<"balloon">>],
                        {0,
                         #{no_local => false,rap => false,
                           retain_handling => send_retain}}}]}]

If we reverse-lookup the topic in RAM, we should see the ClientId as follows (or the atom fanout):

ets:lookup(vmq_trie_subs, {[], [<<"balloon">>]}).
[{{[],[<<"balloon">>]},
  {{[],<<"jules">>},
   {0,
    #{no_local => false,rap => false,
      retain_handling => send_retain}}}}]

If we don't have that (ie the response is []), the Client will be online but will miss messages sent to it, since routing uses the RAM tables.

It is currently unclear, whether only release 2.1.1 is affected. Historically, missing subscriptions were so far only seen with cluster setups where non-empty nodes were joined.

There's currently no reproducible series of steps. It's also undetermined whether the affected MQTT clients (always) get a SUBACK or not.

Exclusion/narrowing tests

Other than investigating the cluster setup, we could check vmq_reg_trie (the process that adds subscriptions to RAM) for race conditions, gen_server2 effects etc.

As a test, we could check whether choosing a gen_server based trie shows the same. (see below vernemq.conf settings, requiring a node restart to take effect).

default_reg_view = vmq_reg_ordered_trie
reg_views = [vmq_reg_ordered_trie]

For testing vmq_reg_trie:init_subscriptions(). can be useful to repopulate the RAM tables with subscriptions from disk (it is what VerneMQ does when the node boots).

To exclude issues with the init_sync method of 2.1.1, we can flag every empty node to skip it with vmq-admin node flag_as_init_synced. This will then just use normal sync. We have to issue the command before we join the node to a cluster.

  • newly built release 2.1.2 behaviour should be compared with 2.1.1 as the vmq_reg_trie module uses gen_server instead of gen_server2.

More info required

  • Check whether remote subscriptions for the ClientId are loaded correctly on any other cluster nodes (check vmq_trie_subs_remote ETS table). Example: Call ets:lookup(vmq_trie_remote_subs, {[],[<<"topic">>, <<"A">>]}). on a remote node.
  • Check whether all nodes agree on where, ie which node, the ClientId is connected to (vmq-admin session show -client_id="client" --client_id --node)

Additional info

  • Cluster was set up from previous deployment with a leave/join strategy but not deleting swc_dkm directory (alongside swc_meta).
  • Update: With above notice on deleting swc_dkm fixed, the issue can still show up.
  • In addition, a "wrong" swc_dkm could not reproduce the issue in (relatively) small scale testing.

Next suggested exclusion tests:

  • 2.1.2
  • Issuing vmq-admin node flag_as_init_synced, before a node joins a cluster. (to skip init sync procedure)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions