continuent.org JIRA  History | Log In     View a printable version of the current page. Get help!  
Issue Details (XML | Word)

Key: TMAN-80
Type: Bug Bug
Status: Open Open
Priority: Major Major
Assignee: Linas Virbalas
Reporter: Edward Archibald
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Tungsten Manager

Manager does not see the correct state of replicators and routers/datasources

Created: 19/Oct/09 12:09 PM   Updated: 21/Oct/09 08:45 AM
Component/s: None
Affects Version/s: None
Fix Version/s: None

Original Estimate: Unknown Remaining Estimate: Unknown Time Spent: Unknown
File Attachments: 1. XML File sample.readonly.xml (0.9 kb)
2. XML File sample.readwrite.xml (0.9 kb)


Sub-Tasks  All   Open   
 Sub-Task Progress: 
1. DATASERVER(state=ONLINE) even when no Monitor running Sub-task Sub-task Open Open Edward Archibald  
2. GC issue causes DS to be visible but REPLICATOR(UNKNOWN) and progress=UNKNOWN Sub-task Sub-task Open Open Linas Virbalas  
3. DS welcome sometimes puts it to online after which other DS dissapears Sub-task Sub-task Open Open Linas Virbalas  

 Description   
I have seen the case, many times, that if you have the connector running first on a node, and then start up all managers at the same time, that often the managers do not see the datasources or do not see the state of the replicators etc. This is the main test - to see if you can reproduce this. I would start with all datasource properties files deleted in the 'datasources' directory on all nodes and then start up everything. I usually see the problem then.

I also often see a similar case that if I change the state of a datasource from the manager, when the connector is running, the state does not change on the router, nor is it reported back, correctly, to the manager.

I attached two files. Just replace the ones in <tungsten-root>/bristlecone/config/evaluator with these. They run the tests against the connector.

Then start cluster-home/bin evaluator_readwrite and then, after it's running, evaluator_readonly

Try cycling the datashave seen the case, many times, that if you have the connector running first on a node, and then start up all managers at the same time, that often the managers do not see the datasources or do not see the state of the replicators etc.
ources between online/offline and 'cluster online/offline' and be sure that the router follows suit. Also be sure the manager sees the same state. NOTE: be sure that you have 'set policy manual' in the manager.




 All   Comments   Work Log   Change History      Sort Order:
Comment by Edward Archibald [19/Oct/09 12:13 PM]
This is the main test - to see if you can reproduce this. I would start with all datasource properties files deleted in the 'datasources' directory on all nodes and then start up everything. I usually see the problem then.

Comment by Linas Virbalas [20/Oct/09 06:36 AM]
I'm having a real pain getting till this issue, because after I started Connector and ran evaluator_readwrite with evaluator_readonly through it, my machine keeps getting stranded and master replicator keeps dying:

INFO | jvm 1 | 2009/08/03 02:45:59 | 2009-08-03 02:45:59,469 INFO extractor.mysql.MySQLExtractor Binlog extraction is supported for this MySQL version: 5.1.29-rc-community-log
INFO | jvm 1 | 2009/08/03 02:46:02 | 2009-08-03 02:46:02,744 INFO tungsten.replicator.ReplicatorManager Sent State Change Notification ONLINE:SLAVE -> ONLINE:MASTER
ERROR | wrapper | 2009/08/03 02:47:34 | JVM appears hung: Timed out waiting for signal from JVM.
ERROR | wrapper | 2009/08/03 02:47:35 | JVM did not exit on request, terminated
STATUS | wrapper | 2009/08/03 02:47:35 | JVM received a signal SIGKILL (9).
STATUS | wrapper | 2009/08/03 02:47:40 | Launching a JVM...
INFO | jvm 2 | 2009/08/03 02:47:40 | WrapperManager: Initializing...
INFO | jvm 2 | 2009/08/03 02:47:41 | 2009-08-03 02:47:41,144 INFO tungsten.replicator.ReplicatorManager Starting Replicatior Manager
INFO | jvm 2 | 2009/08/03 02:47:41 | 2009-08-03 02:47:41,186 INFO tungsten.replicator.EventDispatcher Starting event dispatcher
INFO | jvm 2 | 2009/08/03 02:47:41 | 2009-08-03 02:47:41,189 INFO replicator.conf.PropertiesManager Reading static properties file: /opt/tungsten-enterprise-2009-1.2/cluster-home/bin/../../tungsten-replicator/conf/replicator.properties
INFO | jvm 2 | 2009/08/03 02:47:41 | 2009-08-03 02:47:41,400 INFO commons.jmx.JmxManager JMXConnector started at address service:jmx:rmi:///jndi/rmi://centosa.localdomain:10000/replicator
INFO | jvm 2 | 2009/08/03 02:47:41 | 2009-08-03 02:47:41,412 INFO replicator.conf.PropertiesManager Reading static properties file: /opt/tungsten-enterprise-2009-1.2/cluster-home/bin/../../tungsten-replicator/conf/replicator.properties
INFO | jvm 2 | 2009/08/03 02:47:41 | 2009-08-03 02:47:41,418 INFO tungsten.replicator.ReplicatorManager Sent State Change Notification START -> OFFLINE:NORMAL

I have closed both evaluators for a while now, but replicator keeps dying while catching up with MySQL binary log. Also, the load generated by evaluators was very unhealthy: a few spikes and than zero... a few spikes... zero... more zero than spikes.

Comment by Linas Virbalas [20/Oct/09 01:56 PM]
These were Replicator issues now logged TREP-369 and TREP-370. I'm switching to statement replication to avoid these and to concentrate on this TMAN-80 specifically.

Comment by Linas Virbalas [20/Oct/09 02:59 PM]
I was able to reproduce problem even on only one node and even without the evaluator running. Please find the details bellow.

*** Abstract ***

In short the problem is that:
Connector is "caching" the data sources and Manager blindly believes that what Connector tells is true. While it is not always the case.

*** Details ***


* Reproduction scenario 1 *


1) Stop Manager and Connector:

  ./manager stop
  ./connector stop

2) Remove on-disk representation of datasources:

  rm cluster-home/conf/cluster/default/datasource/*

3) Ensure that replicator is OFFLINE:

  [root@centosa bin]# ./trepctl
  State: OFFLINE:NORMAL
  Seqno Range: -1 -> -1

4) Start Connector - it is very important to start Connector first, so it would "cache" what it see - i.e. no data source, because Replicator is OFFLINE.

  [root@centosa bin]# ../../tungsten-connector/bin/connector start
  Starting Tungsten Connector...

5) Enable Replicator - do this before Manager is running so the Manager would not create the data source. Here might lie one of the real causes of problem, that Connector will not create the data source during this step in his own cache:

  [root@centosa bin]# ./trepctl online
  State: ONLINE:MASTER
  Seqno Range: 0 -> 8

6) Start the Manager - it will get data source information from Connector as a reliable source:

  [root@centosa bin]# ./manager start
  Starting Tungsten Manager Service...

7) But it was not reliable in this scenario! Not only Connector does not know and report to the Manager about the data source, the Manager does no "CREATING NEW DataSource 'centosa.localdomain'" by itself either:

  [LOGICAL] /cluster/default/> ls

  COORDINATOR[centosa.localdomain:MANUAL]

  ROUTERS:
  +-----------------------------------------------------------------------+
  |centosa.localdomain:10999:ONLINE |
  +-----------------------------------------------------------------------+

  DATASOURCES:


No on-disk files either:

  [root@centosa datasource]# ls
  [root@centosa datasource]#

8) In order to induce a creation of the datasource we need to reenable replicator:

  [root@centosa bin]# ./trepctl offline
  State: OFFLINE:NORMAL
  Seqno Range: -1 -> -1
  [root@centosa bin]# ./trepctl online
  State: SYNCHRONIZING
  Seqno Range: 0 -> 8

This leads to:

  INFO | jvm 1 | 2009/08/03 07:09:20 | [07:09:20] INFO [com.continuent.tungsten.cluster.manager.policy.EnterprisePolicyManager] - REPLICATOR 'centosa.localdomain' STATE TRANSITION OFFLINE => SYNCHRONIZING
  INFO | jvm 1 | 2009/08/03 07:09:24 | [07:09:24] INFO [com.continuent.tungsten.cluster.manager.policy.EnterprisePolicyManager] - REPLICATOR 'centosa.localdomain' STATE TRANSITION SYNCHRONIZING => ONLINE
  INFO | jvm 1 | 2009/08/03 07:09:24 | [07:09:24] INFO [com.continuent.tungsten.cluster.manager.policy.EnterprisePolicyManager] - COORDINATOR: POLICY MODE=MANUAL
  INFO | jvm 1 | 2009/08/03 07:09:24 | [07:09:24] INFO [com.continuent.tungsten.cluster.manager.policy.EnterprisePolicyManager] - COORDINATOR: CREATING NEW DataSource 'centosa.localdomain'

And this to new data source on-disk and in cctrl:

  [LOGICAL] /cluster/default/> ls

  COORDINATOR[centosa.localdomain:MANUAL]

  ROUTERS:
  +-----------------------------------------------------------------------+
  |centosa.localdomain:10999:ONLINE |
  +-----------------------------------------------------------------------+

  DATASOURCES:
  +-----------------------------------------------------------------------+
  |centosa.localdomain(master:ONLINE, progress=10) |
  +-----------------------------------------------------------------------+
  | REPLICATOR(role=master, state=ONLINE) |
  | DATASERVER(state=ONLINE) |
  +-----------------------------------------------------------------------+


* Reproduction scenario 2 *


1. Ensure Replicator is ONLINE.

2. And Connector is started & seen a good data source (Manager is started too).

3. Stop the Manager, but leave the Connector.

4. rm datasource/*

5. Start the Manager => data source will be visible.

6. Put Replicator OFFLINE => data source updated in cctrl, but no new data source on-disk file as Connector is still there.

7. Stop the Connector => data source will dissapear from the Manager!

Comment by Edward Archibald [20/Oct/09 04:40 PM]
I did this exact sequence of steps for Scenario 1 and although it did take some time for the manage to do so - probably 5-10 seconds on my system - it always created a new datasource. I suspect that since Linas is using VMs that are kind of slow, perhaps the manager is 'skipping' some notifications and that, perhaps, this is causing the manager to miss a report that the replicator and dataserver are both online. These notifications, however, should be sent every <monitoring-interval> seconds or so, by default every 3 seconds, so the manager should ALWAYS eventually create datasources if there is not already a datasource and if the Replicator and DataServer are both online. Here's the rule. Please review for correctness:

rule "AUTOMATICALLY CREATE NEW DATASOURCES"
salience 100
when
currentNotification : ReplicatorNotification(
$replicatorName : resourceName,
$resourceProps : resourceProps,
resourceState == ResourceState.ONLINE)
from entry-point "MONITORING"

dataserverNotification : DataServerNotification(
resourceName == $replicatorName,
resourceState == ResourceState.ONLINE)
from entry-point "MONITORING"

eval(cluster.replicatorGetState($replicatorName) == ResourceState.ONLINE ||
cluster.replicatorGetState($replicatorName) == ResourceState.SYNCHRONIZING)

not (DataSource() from cluster.dataSourceGet($replicatorName))

eval(cluster.isCoordinator())
then
Object[] params = {$replicatorName};
logger.info("COORDINATOR: POLICY MODE=" + policyMgr.getMode());
logger.info(String.format("COORDINATOR: CREATING NEW DataSource '%s'", params));
try
{
TungstenProperties newDs = DataSource.createFromReplicatorStatus($resourceProps);
cluster.dataSourceCreate(newDs);
}
catch(ClusterManagerException e)
{
Object[] err = {$replicatorName, e.toString()};
logger.error(String.format(
"Cannot create new datasource for '%s', error=%s", err), e);
}
statistics.increment("CREATE NEW DATASOURCES");

end

Comment by Linas Virbalas [21/Oct/09 08:45 AM]
FIY, Today I've built from new code and I cannot reproduce 1st scenario also! To be exact, now step (7) _is_ creating data source and saving it on disk.
My previous code was of Yesterday's morning. I wonder might this be related to the most recent changes in the manager code?

The 2nd scenario is still reproducible.