Day 38 of 60: Multiple queues, one queue runner

Posted by nik on August 16, 2006

Today I’m looking at the results that I’ve obtained from the latest round of tests. These tests used sendmail -q to deliver 30,000 messages to a different zone. There were 10 runs to each test, and the different tests collected data on timings for 1, 5, 10, 20, 30, and 40 queue directories.

Highlevel overview of Sendmail queue processing

Before diving in to the results it’s worth looking at what Sendmail does when it runs the queue like this. To get specific data, I’ve written queue-run.d. This shows, at a fairly high level, what happens when Sendmail processes the queue.

Here’s some sample output that it produced. In this test I had 5 queue directories. 4 of the directories were empty, while the first directory contained 2 messages (so 4 files in total). I’ve removed the timestamps and lock count columns from the output.

This first section of output:

16537 runqueue:queue-run-start
16537 runqueue:queue-run-end
16538 gatherq:queue-read-start
16538 Locked qfk7GD224j016534           'sh, nb'   1
16538 Locked qfk7GD224i016534           'sh, nb'   1
16538 gatherq:queue-read-end

shows that the sendmail -q process that I started was pid 16537. This process started the queue run, and almost immediately finished. However, it forked process 16538, and it’s this process that’s going to deliver the messages.

Recall that there are 5 queue directories. The queue-read-* probes fire at the beginning and end of the section of code that processes queue directories, and we see here one queue directory being processed. 2 qf* files have been found, and a shared (”sh”), nonblocking (”nb”) lock has successfully (”1″) been obtained for both files.

This next section:

16538 gatherq:queue-read-start
16538 gatherq:queue-read-end
16538 gatherq:queue-read-start
16538 gatherq:queue-read-end
16538 gatherq:queue-read-start
16538 gatherq:queue-read-end
16538 gatherq:queue-read-start
16538 gatherq:queue-read-end

shows another four queue-read-* pairs, one for each of the remaining four queue directories. There are no queue files in these directories, so there’s no work to be done.

The queue runner now sorts the message it has found, to determine the order in which to deliver them.

16538 sortq:queue-sort-start
16538 sortq:queue-sort-end

This has generated a sorted list of work for a queue runner to perform, and one is now started.

16538 runner_work:queue-runner-start
16538 dowork:queue-runjob-start
16538 Locked qdir00/qfk7GD224j016534    'ex, nb'   1
16538 dowork:queue-runjob-end

Each queue entry is a job to be run, and here we see the queue-runjob-* probes firing. As part of the job-running process the qf* file for this job is locked. It’s an exclusive lock this time, instead of a shared lock. The queue runner then connects to the remote system and delivers the message.

This is repeated for the second job in the queue, and with both messages delivered the queue runner ends.

16538 dowork:queue-runjob-start
16538 Locked qdir00/qfk7GD224i016534    'ex, nb'   1
16538 dowork:queue-runjob-end
16538 runner_work:queue-runner-end

Consequences

You can see from this that there are a number of high-level processes happening here.

  • The queues have to be read (queue-read-* probes)
  • The queue entries have to be sorted (queue-sort-* probes)
  • A queue runner is started (queue-runner-* probes)
  • The queue runner runs one or more jobs (queue-runjob-* probes)

The time for all these stages might be affected by the number of queue directories, and this is what queue-run-duration.d reports on.

Result format

View raw data

The results are contained in multiple files, one per set of queue directories. They are CSV files, and look a little like this:

PID,queue-run,queue-read,queue-sort,queue-runner,queue-runjob
15510,0,428899305,6936775,40084251281,39863107039
15521,0,435772123,6547168,40390482231,40184604420
15509,32860952,0,0,0,0
15520,4862161,0,0,0,0

Here, PIDs 15510 and 15521 were forked to do the actual queue run. That’s why that have time in all columns except queue-run. Conversely, PIDs 15509 and 15520 are the parents of these two, in which only the queue-run-* probes fired.

The queue-run times aren’t especially interesting, as that’s carried out by the first process before it forks. It’s the other columns where differences in the number of queue directories might show up.

Results for queue-read

This probe covers the time it takes to iterate through all the files in the queue. The results for different numbers of queue directories are:

x queue-read.01
+ queue-read.05
* queue-read.10
% queue-read.20
- queue-read.30
@ queue-read.40
: = Mean
M = Median
+----------------------------------------------------------+
|%-+%-%-%@%@@%@-*--*+-@ @  @@@@  x  xx xx xx     x        x|
|                               |______M:________|         |
|    |_____:_____|                                         |
|         |____M:____|                                     |
|  |____M:_____|                                           |
|    |____M:______|                                        |
|           |________:_M_____|                             |
+----------------------------------------------------------+

That certainly looks as though there’s a benefit to multiple queue directories. The range bars clearly show it somewhat to the right of the other ranges, and therefore taking longer.

The statistical results bear this out (nanoseconds have been converted to microseconds) in these figures.

N Min Max Median Mean Stddev
x 10 428.89931 464.33992 443.42052 444.47453 9.7989508
+ 10 402.45654 421.85465 411.10078 411.51274 6.8551837
Difference at 99.5% confidence
-32.9618 +/- 13.652
-7.4159% +/- 3.07148%
* 10 405.4721 425.66973 416.3516 416.6921 6.0080299
Difference at 99.5% confidence
-27.7824 +/- 13.1215
-6.25062% +/- 2.95215%
% 10 400.15261 424.28187 407.71897 409.16544 6.7259599
Difference at 99.5% confidence
-35.3091 +/- 13.5679
-7.94401% +/- 3.05258%
- 10 401.18106 423.11366 410.45674 411.72426 7.1450633
Difference at 99.5% confidence
-32.7503 +/- 13.8443
-7.36831% +/- 3.11476%
@ 10 409.59177 432.69872 425.02406 422.27599 9.1898042
Difference at 99.5% confidence
-22.1985 +/- 15.336
-4.99433% +/- 3.45037%

It would seem that, at least until 30 queue directories, there is a 6-7% reduction in the amount of time it takes to read the queue.

Results for queue-sort

This probe covers the time it takes to sort the queue items once they have been gathered. These results use the default Sendmail queue sort order, “Priority”. The results for different numbers of queue directories are:

x queue-sort.01
+ queue-sort.05
* queue-sort.10
% queue-sort.20
- queue-sort.30
@ queue-sort.40
: = Mean
M = Median
+----------------------------------------------------------+
|        *x          -     @@++  %                         |
|x x x   ***-*   - %%- x  -@@-+-+%@ @ *  *@-  @ %-%+    %@@|
|   |___________M_:_____________|                          |
|                         |___M___:_______|                |
|      |____M_______:_____________|                        |
|                      |_________M__:_____________|        |
|                |_________M_:___________|                 |
|                         |________M__:___________|        |
+----------------------------------------------------------+

There’s no apparent difference between the results on this plot, and the statistical summary from ministat bears this out.

N Min Max Median Mean Stddev
x 10 6.492041 7.044234 6.721011 6.7525311 0.2109344
+ 10 6.897024 7.236107 6.9306245 6.9814723 0.118952
No difference proven at 99.5% confidence
* 10 6.607956 7.097123 6.6564425 6.7786936 0.20168636
No difference proven at 99.5% confidence
% 10 6.765709 7.319717 6.9684315 7.0233506 0.19899698
No difference proven at 99.5% confidence
- 10 6.652388 7.212086 6.885209 6.913707 0.18356812
No difference proven at 99.5% confidence
@ 10 6.882107 7.346733 7.00444 7.0520978 0.17935072
No difference proven at 99.5% confidence

Results for queue-runner

This probe covers the time it takes to deliver all the messages in the queue. The results for different numbers of queue directories are:

x queue-runner.01
+ queue-runner.05
* queue-runner.10
% queue-runner.20
- queue-runner.30
@ queue-runner.40
: = Mean
M = Median
+----------------------------------------------------------+
|   - -  %@  %** * ++++      xxx          -       @        |
|   @@-%%-@@@@--** *+++    xxxxxx   -     -       @       @|
|                           |_:|                           |
|                  |:|                                     |
|            |_:_|                                         |
|      |__:___|                                            |
| |________M______:_______________|                        |
||__________M_________:_____________________|              |
+----------------------------------------------------------+

Now these look quite interesting. There’s a clear benefit to be had as queue directories are added, as can be seen by the gradual leftward shift of the range bars. Note, too, that the standard deviation slowly increases as the number of queue directories increases.

This continues until we get to 30 queue directories, and suddenly the range of values increases dramatically. This pattern holds for the 40 queue directory results too. This would seem to indicate that there’s a factor that’s negligible with 20 or fewer queue directories that suddenly becomes much more significant.

The statistical results show this in more detail.

N Min Max Median Mean Stddev
x 10 39546.617 40595.753 40197.703 40163.428 338.53016
+ 10 37840.309 38448.061 38164.578 38170.287 209.78723
Difference at 99.5% confidence
-1993.14 +/- 454.65
-4.96258% +/- 1.132%
* 10 36540.079 37843.604 36954.964 37055.675 430.47168
Difference at 99.5% confidence
-3107.75 +/- 625.175
-7.73777% +/- 1.55658%
% 10 35043.263 36947.376 35942.349 35993.725 697.82562
Difference at 99.5% confidence
-4169.7 +/- 885.417
-10.3818% +/- 2.20454%
- 10 34620.743 42961.967 36161.556 37588.728 3451.5012
No difference proven at 99.5% confidence
@ 10 34589.025 46344.242 36245.368 38587.428 4649.2078
No difference proven at 99.5% confidence

5, 10, or 20 queue directories makes the delivery process take 5%, 8%, and 10% faster respectively. But there’s no significant benefit with 30 or 40 queue directories. Note that these results do not mean that 30 or 40 queue directories have the same benefit as 20 — they mean that 30 or 40 queue directories offer no benefit over 1 queue directory.

Interesting. I wonder why.

Results for queue-runjob

This probe covers the time it takes to deliver all the messages in the queue. This is slightly different from the queue-runner probe. In this results, queue-runjob probes will have fired for every message delivery. The times taken by these probes have been summed.

The queue-runner probe only fires at the start and end of the whole process. So the totall cumulative time for all the queue-runjob probes is going to be less than that for queue-runner. These results may or may not show the same interesting features at 30 and 40 queue directores as queue-runner.

The results for different numbers of queue directories are:

x queue-runjob.01
+ queue-runjob.05
* queue-runjob.10
% queue-runjob.20
- queue-runjob.30
@ queue-runjob.40
: = Mean
M = Median
+----------------------------------------------------------+
|   - -  %@  %** * ++++      xxx          -                |
|   @@-%%-@@@@--** *+++    xxxxx    -     -       @@      @|
|                           |_:|                           |
|                   :|                                     |
|            |_:_|                                         |
|      |__:___|                                            |
| |________M______:_______________|                        |
||__________M_________:_____________________|              |
+----------------------------------------------------------+

Well, that answers that question. This looks very similar to plot for queue-runner, which would imply that whatever is taking additional time in the 30 and 40 queue directory case is bounded by the queue-runjob probes.

The statistical results are similar to queue-runner too.

N Min Max Median Mean Stddev
x 10 39351.85 40370.584 39973.804 39950.51 330.588
+ 10 37657.682 38238.619 37973.41 37977.145 204.91243
Difference at 99.5% confidence
-1973.37 +/- 444.012
-4.93953% +/- 1.1114%
* 10 36351.686 37631.44 36764.907 36857.071 419.81903
Difference at 99.5% confidence
-3093.44 +/- 610.011
-7.74318% +/- 1.52692%
% 10 34857.73 36736.394 35747.589 35795.814 684.95635
Difference at 99.5% confidence
-4154.7 +/- 868.243
-10.3996% +/- 2.1733%
- 10 34440.561 42751.05 35948.914 37393.308 3445.2905
No difference proven at 99.5% confidence
@ 10 34408.408 46132.48 36035.342 38390.119 4648.62
No difference proven at 99.5% confidence

There are similar percentage improvements in the 5, 10, and 20 queue directory cases, and then 30 and 40 queue directories completely lose that advantage, and become as slow as a single queue directory.

Conclusions

Please read the disclaimer in conjunction with these conclusions.

When you have a single process running the queue then you can expect that multiple queue directories will have the following effects:

  • The time taken to read the queue will be reduced by 6-7%
  • With the default queue sort order of “Priority” there is no significant performance gain or loss when using multiple queues.
  • Additional queue directories make the delivery processes run faster, up to a point. There’s a point somewhere between 20 and 30 queue directories, where additional queue directories is a step backwards, and any benefit from having them is lost.
Trackbacks

Use this link to trackback from your own site.

Comments

Leave a response

  1. Dominic Mitchell Wed, 16 Aug 2006 21:06:14 PDT

    Hmmm, I’m going to have to read more carefully. I was convinced I was looking at nethack screens, not stats, for a moment. :-)

  2. nik Wed, 16 Aug 2006 23:19:32 PDT

    Heh. No amulets here.

  3. […] You can verify this by looking at the results for test 20, which follow the same format as those for test 19. If you run those through ministat you’ll see no statistically significant differences for any of the results. […]

Comments


Close
E-mail It