PHPNW12: Don’t reboot, debug!

Some notes from Joshua Thijssen’s talk, which focussed on the knowledge and methods needed to deal with problems in production in a measured way. I was at the back of the room, and some of this is outside my wheelhouse, but here’s the notes; just bear in mind I might have either misheard or misunderstood stuff and feel free to correct me in the comments.

Deal with the problem now. Don’t reboot. Don’t reboot your system every night on a Cron job to solve a slowdown! Listen to your problems, sort them out and save yourself some future pain.

If you’re rebooting every night to solve some system slowdown, but your visitors suddenly increase 200% then you’re now rebooting at night AND at lunchtime. Not good. Not sustainable.

PHPNW EXCLUSIVE: Joshua’s bottleneck trouble shooting flow chart: Site is slow or not responding? It’s your database. 99% of the time, it’s your database.

Show of hands. How many people are using MySQL? Why? MySQL is not easy to configure; too many inter-related configuration options. Lots of ways to trip yourself up.

Not the database? Look at Apache and PHP. 99% of these times, it’s PHP but remember PHP sits inside Apache.

Backups can require a lot of resources. If the system is at capacity, then the automatic backup run can take the whole thing down.

Cron jobs can mount up; consider a Cron job run every five minutes which takes ten minutes, over the day several are running simultaneously.

Linux 101

Caveat: this section is probably where my notes get sketchy…

Use htop rather than top. Why? It has one extra letter AND the output looks nicer, with better information about each process to show various states: runnable, interruptible sleep (waiting for something, an alarm, an event, a time), uninterruptible sleep (waiting for something but also hogging the CPU while it does so, normally only used by high IO systems, lots of these means IO is blocked), stopped processes (probably someone is debugging), zombie processes (defunct).

Back to zombies: they sound bad, but usually bad programming and/or bad administration. If there’s just a few, don’t worry, it’s probably not worth rebooting to get rid of them.

Memory

A process on a 32bit Linux system can use 4gb of memory. If this isn’t available in RAM then it will use disk, moving things between memory and disk (swapping) as they’re required.

Processes can allocate memory they aren’t using right not, for example Varnish will take a certain amount of memory and reserve it even if it’s not being used.

  • Virtual memory: allocated not used
  • Resident memory: allocated and used
  • Shared memory: shared objects
  • Swapped memory: memory on disk

(Did I get that right?)

It’s really hard to work out how much memory a process is actually using. For example: all the Apache processes might look like their using X per process, but with their shared parent this is not entirely true.

Perhaps a better question is “how much free memory does the system have?” Look to the memory used without buffers and cache as a proportion of total memory… note that buffers and cache will take up the remaining available resources. HOWEVER ensure that you allow enough remainder, as buffers and cache are needed to keep the system running smoothly.

Monitoring

If in doubt? Monitor everything. Monitor your monitors even, and use proactive monitoring AND alerting. Munin is good too, for working out what’s going on or what has gone on.

In addition to monitoring, log everything, don’t stop logging because it’s slowing your system… instead get more resource and keep logging.

Look into logstash and graylog are good tools.

Most used system tool: tail, because it’s quick, easy. Also look into vmstat for memory information and iostat for IO. Really easy to diagnose problems with these. And don’t forget the proc system, which contains all kind of system information about what’s going on.

Joshua also talked a lot about tcpdump, netdump, strace, dtrace, systemtap, which all look like things worth dipping into at a later date.

MySQL proxy: what is being sent, what isn’t. One to look into.

Before you go live: think about your app and infrastructure. Is everything in place? Preparation makes perfect.

Design for vertical scalability. Horizontal scalability is easier, more memory, more CPUs, but also more restrictive.

Reduce or eliminate single points of failure. Not one server, many.

Don’t run at maximum capacity. Scale to avoid this. Leave something for when you get peaks or system hogs.

Make a plan for if something goes wrong. Planning during the emergency is really hard and not fun.

One machine for one purpose: one DB, one web server, one email server, etc. VMs are easy to setup. VMs vs real machines: nice to keep them simpler, separated configs, but bear in mind the contended resources for the host system.

Try and avoid synchronicity, for example, the email doesn’t need to be sent while you wait.

If you hit a production problem: Don’t panic. Don’t reboot. Debug. Keep calm. Think. Analyse. Isolate the issue and solve it.

Know your environment. I think this was Joshua’s main message, to my ears anyway.

Join the Conversation

1 Comment

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.