MySQL: Slave lag behind master and data corruption.

I am reaching out to mysql experts out there! We just inherited a site which uses MySQL master/slave replication. We set up the master and slaves per MySQL documents and did some of our own tweaking in my.cnf to get things working. Things were great when we were testing the site but than after we put it up, we started to notice missing data on slaves. At first, we thought maybe we missed something with the import of data on slaves, so we re-imported. All tables in our db, at that time, were myisam. After a day or so, we started to see data corruption on some of the slaves. We went through and repaired all the tables. Few hours after the repair, we checked row counts and the repaired tables didn’t have same count as master did. Eventually most of our slaves were at the same state so we decided to re-import data from master again.

Before we went ahead and re-imported, we made a decision to move to innodb for *most* of our crucial tables. After altering on master, we did a dump/import. Converting tables to innodb got rid of our data corruption problem but our missing rows issue is still there. Hardware on the master is 4×1.86Ghz, 8 gigs of ram; slaves hardware is: 4×1.86, 4 gigs of ram.

Here is the config running on our master:
< some of the part of the config has been cut off to keep it short/simple and to the point >
skip-locking
ft_min_word_len=2
long_query_time=30
max_connect_errors=10000
max_connections=1000
#skip-grant-tables
skip-bdb
delay-key-write=ALL
key_buffer = 1G
read_buffer_size = 2M
read_rnd_buffer_size = 8M
myisam_sort_buffer_size = 64M
thread_concurrency = 8
skip-name-resolve
max_allowed_packet = 32M
#low-priority-updates
table_cache = 1024
thread_cache = 8
query_cache_size = 64M
#
# added by sunny >
wait_timeout = 120
tmp_table_size = 128M
sort_buffer_size = 128M # Sort buffer is used to perform sorts for some ORDER BY and GROUP BY queries.
read_rnd_buffer_size = 8M # Read the rows in sorted order by using the row
# pointers in the result file. To optimize this,
# we read in a big block of row pointers, sort them,
# and use them to read the rows in sorted order into a row buffer.
thread_cache_size = 8
#
# Try number of CPU's*2 for thread_concurrency
thread_concurrency = 16
log_slow_queries
log_long_format
#
# INNODB Settings
innodb_additional_mem_pool_size = 32M # was 16
innodb_buffer_pool_size = 4G
innodb_file_per_table
innodb_log_file_size = 50M
innodb_log_buffer_size = 8M
innodb_flush_log_at_trx_commit = 0
innodb_lock_wait_timeout = 50
innodb_flush_method=O_DIRECT

Here is the config running on our slave:

# The MySQL server
[mysqld]
port = 3306
socket = /tmp/mysql1.sock
pid-file = /usr/local/mysql/var/mysql.pid
relay-log = /usr/local/mysql/var/mysql-relay-bin
relay-log-index = /usr/local/mysql/var/mysql-relay-bin.index
log-error = /usr/local/mysql/var/mysql.err
master-info-file = /usr/local/mysql/var/mysql-master.info
relay-log-info-file = /usr/local/mysql/var/mysql-relay-log.info
datadir = /usr/local/mysql/var
server-id = 31
skip-slave-start
skip-locking
skip-name-resolve
ft_min_word_len=2
long_query_time=30
max_connect_errors=10000
max_connections=1000
#skip-grant-tables
skip-bdb
delay-key-write=ALL
key_buffer = 512M
max_allowed_packet = 1M
table_cache = 512M
myisam_sort_buffer_size = 64M
thread_cache_size = 4
read-only
#slave-skip-errors=all
#
# added by sunny >
wait_timeout = 120
tmp_table_size = 128M
read_buffer_size = 2M
sort_buffer_size = 128M # Sort buffer is used to perform sorts for some ORDER BY and GROUP BY queries.
read_rnd_buffer_size = 8M # Read the rows in sorted order by using the row
# pointers in the result file. To optimize this,
# we read in a big block of row pointers, sort them,
# and use them to read the rows in sorted order into a row buffer.
thread_cache_size = 8
query_cache_size = 16M # changed to 16 3/1/07
#
# Try number of CPU's*2 for thread_concurrency
thread_concurrency = 8
log_slow_queries
log_long_format
#
# INNODB Settings
default_table_type = INNODB
innodb_additional_mem_pool_size = 32M # was 16
innodb_buffer_pool_size = 2G
innodb_file_per_table
innodb_log_file_size = 50M
innodb_log_buffer_size = 8M
innodb_flush_log_at_trx_commit = 0
innodb_lock_wait_timeout = 50
innodb_flush_method=O_DIRECT

Please feel free to comment/scrutinize any of these settings. If you think they should be different, please give quick summary about why they should be different. Some of these settings were inherited from old setup of mysql (where they had it hosted before).

Right now we are just trying to find out why we have missing rows and why our slaves lag behind constantly. We have refreshed data almost every day for last two weeks. We barely have load average above 1 or 2. What can we do to track down the problem? What are the things we should be looking out for?

12 thoughts on “MySQL: Slave lag behind master and data corruption.

  1. Xaprb

    Your sort_buffer_size is too large; it is allocated for every query that does a filesort. You should keep it smaller and only increase it just before each query that does a filesort:

    set @@session.sort_buffer_size = (bigger value);
    (query)
    set @@session.sort_buffer_size = default;

    You should probably do a little investigative work to see what you really need to tune. Any chance you can benchmark before tuning? It might save you a lot of work. Otherwise try mysqlreport or something for some basic direction on it.

    As far as the slaves being behind and missing data, I don’t see anything suspicious (I was initially thinking you’d have some replication filters not doing what you thought). Any errors in the master or slave error logs?

    You might consider posing your question on the mysql general mailing list. Blog comments are hard to use for a conversation about something like this.

  2. GOTG

    Slaves lag if you run heavy queries on myisam tables. Myisam tables lock the whole table at insert/update so you need to use something else for heavy insert/update tables (innodb would be the logical choice). Warning: innodb tables are about 6 times the size of myisam tables.

  3. ryan

    One thing to check is if the missing rows are being written to the binlog at all. I’ve run into a few cases where certain statements are applied to the master but not written to the binlog so of course are not relayed to the slaves.

    And if they are, it would be worth checking to watch the binlog to try to discern a pattern from the missing data. Is it a unusually large statement? Does it use a particular syntax that is unlike your other code? Is it at a regular time interval? These will give important clues to figuring out the root of the problem.

  4. CrazyToon Post author

    GOTG: we initially thought it was the myisam tables so we moved to innodb. Our total increase in disk usage is about 1.5x. We also checked the server IDs and they are all unique.
    Ryan: how does one go about looking into binlogs?

    All good ideas. Keep them coming.

  5. dan

    what is the mysql version? we had similar problem when we were on 4.0 and it seemed to be fixed when we moved to 4.1.xx

  6. GOTG

    To check binary log use mysqlbinlog. It’s unlikely that you have queries bigger than 1M (slave packet size) but I see master having 32M packet size. If large queries are missing from slaves then you may need to increase the packet size on slaves.

  7. Dmitri Mikhailov

    Few things are missing:

    – MySQL version;
    – Error logs;
    – Slow query log;
    – Repair table output.

    Without this information I can only guess…

    – Replication errors? Set max_allowed_packet to the same size on master and slave (in your case it’s set to 32M on master and 1M on slave). Make sure read_buffer_size

  8. Dmitri Mikhailov

    Make sure read_buffer_size is less or equal to max_allowed_packet, there were few bugs related to these variables. “Too many open files” or “Can’t open file” errors? Tune up open_files_limit variable.

    – Mixed myisam/innodb config? “Out of memory” errors in the log? Slave crashes with no traces in the logs? Upgrade the memory on the slave or tune up the server variables.

    – Repair table corrupts data/index file? Make sure myisam_repair_threads is set to 1, just to be safe, there were a few bugs regarding “REPAIR TABLE” corrupting the myisam data/index files.

    – thread_cache_size is defined twice in the slave config file. This just does not make sense.

  9. Steve

    I don’t recommend setting up the slave via a dump/import. It’s better to do it by copying the data files.

  10. CrazyToon Post author

    Mysql version 5.0.45 source install.
    We fixed our sort_buffer_size. I didn’t know that if we put 128M, it would take 128M for each query. I, for some reason, thought it was the ceiling and queries will use whatever it needs until it hit 128M. Apparently I was wrong. Now we set sort_buffer_size=4M

    I will try out the packet size change. Also will check for the errors in the log for the errors Dmitri mentioned.

    Thanks for all of your suggestions/comments. Even if your suggestions might not fix the problem, I am sure we are all learning what to do or not to do. Keep them coming.

  11. Paul M

    Noticed Baron (Xaprb) is too modest, use the MySQL toolkit and do checksums on the slave and you can also use the sync function to sync missing rows.

    You are going to have to use mysqlbinlog to mine the binlogs to determine what SQL was not replicated.
    If you combine table checksum from the toolkit, you should be able to search for the SQL (on the master) which should have updated the slave.

    Other stuff. Review the use of these settings on master/slave.
    innodb_flush_log_at_trx_commit = 0
    innodb_lock_wait_timeout = 50

    From the MySQL 5.0 doco
    http://dev.mysql.com/doc/refman/5.0/en/innodb-parameters.html

    Note: For the greatest possible durability and consistency in a replication setup using InnoDB with transactions, you should use innodb_flush_log_at_trx_commit=1, sync_binlog=1

    Note: Both these are going to slow down the transaction per sec performance of the master, so benchmark first and change to see the difference.

    Have Fun
    Paul
    http://blog.dbadojo.com/2007/09/mysqltoolkit-mysql-table-checksum.html

Leave a Reply

Your email address will not be published. Required fields are marked *