This is the main test - to see if you can reproduce this. I would start with all datasource properties files deleted in the 'datasources' directory on all nodes and then start up everything. I usually see the problem then.
I'm having a real pain getting till this issue, because after I started Connector and ran evaluator_readwrite with evaluator_readonly through it, my machine keeps getting stranded and master replicator keeps dying:
INFO | jvm 1 | 2009/08/03 02:45:59 | 2009-08-03 02:45:59,469 INFO extractor.mysql.MySQLExtractor Binlog extraction is supported for this MySQL version: 5.1.29-rc-community-log
INFO | jvm 1 | 2009/08/03 02:46:02 | 2009-08-03 02:46:02,744 INFO tungsten.replicator.ReplicatorManager Sent State Change Notification ONLINE:SLAVE -> ONLINE:MASTER
ERROR | wrapper | 2009/08/03 02:47:34 | JVM appears hung: Timed out waiting for signal from JVM.
ERROR | wrapper | 2009/08/03 02:47:35 | JVM did not exit on request, terminated
STATUS | wrapper | 2009/08/03 02:47:35 | JVM received a signal SIGKILL (9).
STATUS | wrapper | 2009/08/03 02:47:40 | Launching a JVM...
INFO | jvm 2 | 2009/08/03 02:47:40 | WrapperManager: Initializing...
INFO | jvm 2 | 2009/08/03 02:47:41 | 2009-08-03 02:47:41,144 INFO tungsten.replicator.ReplicatorManager Starting Replicatior Manager
INFO | jvm 2 | 2009/08/03 02:47:41 | 2009-08-03 02:47:41,186 INFO tungsten.replicator.EventDispatcher Starting event dispatcher
INFO | jvm 2 | 2009/08/03 02:47:41 | 2009-08-03 02:47:41,189 INFO replicator.conf.PropertiesManager Reading static properties file: /opt/tungsten-enterprise-2009-1.2/cluster-home/bin/../../tungsten-replicator/conf/replicator.properties
INFO | jvm 2 | 2009/08/03 02:47:41 | 2009-08-03 02:47:41,400 INFO commons.jmx.JmxManager JMXConnector started at address service:jmx:
rmi:///jndi/rmi://centosa.localdomain:10000/replicator
INFO | jvm 2 | 2009/08/03 02:47:41 | 2009-08-03 02:47:41,412 INFO replicator.conf.PropertiesManager Reading static properties file: /opt/tungsten-enterprise-2009-1.2/cluster-home/bin/../../tungsten-replicator/conf/replicator.properties
INFO | jvm 2 | 2009/08/03 02:47:41 | 2009-08-03 02:47:41,418 INFO tungsten.replicator.ReplicatorManager Sent State Change Notification START -> OFFLINE:NORMAL
I have closed both evaluators for a while now, but replicator keeps dying while catching up with MySQL binary log. Also, the load generated by evaluators was very unhealthy: a few spikes and than zero... a few spikes... zero... more zero than spikes.
I was able to reproduce problem even on only one node and even without the evaluator running. Please find the details bellow.
*** Abstract ***
In short the problem is that:
Connector is "caching" the data sources and Manager blindly believes that what Connector tells is true. While it is not always the case.
*** Details ***
* Reproduction scenario 1 *
1) Stop Manager and Connector:
./manager stop
./connector stop
2) Remove on-disk representation of datasources:
rm cluster-home/conf/cluster/default/datasource/*
3) Ensure that replicator is OFFLINE:
[
root@centosa bin]# ./trepctl
State: OFFLINE:NORMAL
Seqno Range: -1 -> -1
4) Start Connector - it is very important to start Connector first, so it would "cache" what it see - i.e. no data source, because Replicator is OFFLINE.
[
root@centosa bin]# ../../tungsten-connector/bin/connector start
Starting Tungsten Connector...
5) Enable Replicator - do this before Manager is running so the Manager would not create the data source. Here might lie one of the real causes of problem, that Connector will not create the data source during this step in his own cache:
[
root@centosa bin]# ./trepctl online
State: ONLINE:MASTER
Seqno Range: 0 -> 8
6) Start the Manager - it will get data source information from Connector as a reliable source:
[
root@centosa bin]# ./manager start
Starting Tungsten Manager Service...
7) But it was not reliable in this scenario! Not only Connector does not know and report to the Manager about the data source, the Manager does no "CREATING NEW DataSource 'centosa.localdomain'" by itself either:
[LOGICAL] /cluster/default/> ls
COORDINATOR[centosa.localdomain:MANUAL]
ROUTERS:
+-----------------------------------------------------------------------+
|centosa.localdomain:10999:ONLINE |
+-----------------------------------------------------------------------+
DATASOURCES:
No on-disk files either:
[
root@centosa datasource]# ls
[
root@centosa datasource]#
8) In order to induce a creation of the datasource we need to reenable replicator:
[
root@centosa bin]# ./trepctl offline
State: OFFLINE:NORMAL
Seqno Range: -1 -> -1
[
root@centosa bin]# ./trepctl online
State: SYNCHRONIZING
Seqno Range: 0 -> 8
This leads to:
INFO | jvm 1 | 2009/08/03 07:09:20 | [07:09:20] INFO [com.continuent.tungsten.cluster.manager.policy.EnterprisePolicyManager] - REPLICATOR 'centosa.localdomain' STATE TRANSITION OFFLINE => SYNCHRONIZING
INFO | jvm 1 | 2009/08/03 07:09:24 | [07:09:24] INFO [com.continuent.tungsten.cluster.manager.policy.EnterprisePolicyManager] - REPLICATOR 'centosa.localdomain' STATE TRANSITION SYNCHRONIZING => ONLINE
INFO | jvm 1 | 2009/08/03 07:09:24 | [07:09:24] INFO [com.continuent.tungsten.cluster.manager.policy.EnterprisePolicyManager] - COORDINATOR: POLICY MODE=MANUAL
INFO | jvm 1 | 2009/08/03 07:09:24 | [07:09:24] INFO [com.continuent.tungsten.cluster.manager.policy.EnterprisePolicyManager] - COORDINATOR: CREATING NEW DataSource 'centosa.localdomain'
And this to new data source on-disk and in cctrl:
[LOGICAL] /cluster/default/> ls
COORDINATOR[centosa.localdomain:MANUAL]
ROUTERS:
+-----------------------------------------------------------------------+
|centosa.localdomain:10999:ONLINE |
+-----------------------------------------------------------------------+
DATASOURCES:
+-----------------------------------------------------------------------+
|centosa.localdomain(master:ONLINE, progress=10) |
+-----------------------------------------------------------------------+
| REPLICATOR(role=master, state=ONLINE) |
| DATASERVER(state=ONLINE) |
+-----------------------------------------------------------------------+
* Reproduction scenario 2 *
1. Ensure Replicator is ONLINE.
2. And Connector is started & seen a good data source (Manager is started too).
3. Stop the Manager, but leave the Connector.
4. rm datasource/*
5. Start the Manager => data source will be visible.
6. Put Replicator OFFLINE => data source updated in cctrl, but no new data source on-disk file as Connector is still there.
7. Stop the Connector => data source will dissapear from the Manager!
I did this exact sequence of steps for Scenario 1 and although it did take some time for the manage to do so - probably 5-10 seconds on my system - it always created a new datasource. I suspect that since Linas is using VMs that are kind of slow, perhaps the manager is 'skipping' some notifications and that, perhaps, this is causing the manager to miss a report that the replicator and dataserver are both online. These notifications, however, should be sent every <monitoring-interval> seconds or so, by default every 3 seconds, so the manager should ALWAYS eventually create datasources if there is not already a datasource and if the Replicator and DataServer are both online. Here's the rule. Please review for correctness:
rule "AUTOMATICALLY CREATE NEW DATASOURCES"
salience 100
when
currentNotification : ReplicatorNotification(
$replicatorName : resourceName,
$resourceProps : resourceProps,
resourceState == ResourceState.ONLINE)
from entry-point "MONITORING"
dataserverNotification : DataServerNotification(
resourceName == $replicatorName,
resourceState == ResourceState.ONLINE)
from entry-point "MONITORING"
eval(cluster.replicatorGetState($replicatorName) == ResourceState.ONLINE ||
cluster.replicatorGetState($replicatorName) == ResourceState.SYNCHRONIZING)
not (DataSource() from cluster.dataSourceGet($replicatorName))
eval(cluster.isCoordinator())
then
Object[] params = {$replicatorName};
logger.info("COORDINATOR: POLICY MODE=" + policyMgr.getMode());
logger.info(String.format("COORDINATOR: CREATING NEW DataSource '%s'", params));
try
{
TungstenProperties newDs = DataSource.createFromReplicatorStatus($resourceProps);
cluster.dataSourceCreate(newDs);
}
catch(ClusterManagerException e)
{
Object[] err = {$replicatorName, e.toString()};
logger.error(String.format(
"Cannot create new datasource for '%s', error=%s", err), e);
}
statistics.increment("CREATE NEW DATASOURCES");
end
FIY, Today I've built from new code and I cannot reproduce 1st scenario also! To be exact, now step (7) _is_ creating data source and saving it on disk.
My previous code was of Yesterday's morning. I wonder might this be related to the most recent changes in the manager code?
The 2nd scenario is still reproducible.