Mailer Redistribution

Moving one or a few users

There's a script, /usr/local/etc/movemail, that can be used to move a group of selected users from one deskmail system (aka "Mailer") to another (eg, from one cg to another or from a bp to a cg, etc). The syntax for the movemail script is:
       movemail [-force|-local] user1 mailerN1
       movemail [-force|-local] user2 mailerN2
              ...
       movemail [-force|-local] usern mailerNn
       movemail DONE
Normally movemail will not move a user if the user is logged in anywhere in the galaxy. The -local option can be used to relax this constraint to only check the local mailer host and ignore the rest of the hosts in the galaxy. The -force option can be used to move them even if the watcher on the local mailer says they're active (maybe you just HUP'd their last running imapd or you had them disconnect via the phone and you don't want to wait for them to flush out of the watcher in 15 minutes).

The movemail script works by first renaming their local mailer directory to whatever_moving. This prevents any incoming mail from being delivered, mail will just be "deferred" and left in /usr/spool/mqueue. It also gives the user some cryptic error if they should attempt to connect via imapd, but that's acceptable as it's not expected to happen in the time it takes to move them. It then copies the files over to the new mailer and sets the new home directory in the passwd file. After removing the old directory, it then creates a dummy new home directory with a .forward file. For example if it's moving a user from /bp04 to /bp17/d22/u1235, it will create a /bp17/d22/u1235 directory on bp04 containing a .forward file pointing to user@u.washington.edu. The final movemail with the DONE parameter runs sendmail to deliver all the pending mail and then removes all the dummy .forward files. Unfortunately we don't handle over quota mail at this time -- maybe that will just go away.

Quite often there are people who send mail to a particular mailer instead of the generic user@u.washington.edu. This can be due to the user telling people to send to it or it can be due to people using user@user.deskmail.washington.edu as their return address. This gets expanded to user@mailerNN.u.washington.edu by sendmail automatically and that's what people will end up putting in their addressbooks. There's a script, /ux01/ken/maildist/funkyfor, that will accept a list of users and will forward the mail for these users to the appropriate mailer, but you can't hold these peoples' hands forever.

Balancing all the mailers

The /ux01/ken/maildist directory contains my tools for redistributing the users amonst the mailers to balance the load. When I'm called upon to do that, I perform the following sequence. Each command is run on a window on the specified system that has been cd'd into the maildist directory.
  1. The first pass, mdist1, will generate a file called NN/raw that contains a list of all the users and the amount of connect time they've used.
       #seuss02> mdist1 NN
    
    Where 'NN' is a unique two digit number to allow multiple runs at the same time (on different galaxies). In the past the mdist1 utitility also listed the disk usage of each user, but that field took a lot of time to compute and wasn't used since the mailers had plenty of disk space.

  2. The second pass, mdist2, will read the NN/raw file and generate a NN/moves file that lists how many users of each category should be moved betwixt the mailers. Due to the volume of output, use a window at least 160 columns wide.
       #seuss02> mdist2 NN
    
    Mdist2 will generate move directives to redistribute the users from one mailer to another or from one bp to another. It won't try to move users from bps to mailers or vice versa (there's an mdist2b that will do that). The mdist2 utility will categorize each user based on how much connect time they have and how much disk space they're using. Each user will be pigeonholed into one of 25 different user types. The redistribution will try to get the same number of each type of user on each mailer subject to the weighting factors.

    The mailer file describes the weighting of each mailer if they're unevenly resourced. An alternate file can be selected with the -m option on the mdist2 command line if a temporary weighting change is required. Mdist2 will also create a NN/sort file which has the list of users sorted in a random way to be used by the next pass.

  3. The third pass, mdist3a, will assign a new deskmail system to each user by reading the NN/sort and NN/moves files and generating a NN/todo file. This file is again sorted on a random key so that parallel mdist4 processes won't all try to fill the deskmail systems in the same order.
       #seuss02> mdist3a NN
    
  4. The last pass, mdist4, moves the mailer users off of each of the mailers that it is run on.
       #mailerxx> mdist4 NN
       #bpxx>     mdist4 NN
       #epxx>     mdist4 NN
    
    This will invoke the movemail script on an appropriate number of users from each category as specified in the NN/todo file. This command should be run on each of the mailers in order to achieve the proper effect. If the process does not move everyone who was scheduled to be moved, it will create NN/todo.host When reinvoked, mdist4 will pick up where it left off and attempt to move users that failed for some reason or another on the previous attempt (such as being logged in). The mloop4 NN command will invoke mdist4 repeatedly with a half hour pause been each run until it completes.

    Clearing all the users off one or more mailers

    Occasionally we have to retire a particular mailer host and I'm called upon to clear all the users off of it. The first step in this process is to make sure the QDF is updated to not allow creation of any new accounts. Ie, set the "N" flag to zero.

    The rest of the process consists of running several passes of the above procedure with a modified -m mailers file specified on the mdist2 command. That file supports a "move all users off this mailer" flag. The mdist4 pass can run for several hours in this case. After it completes it will indicate the users it was unable to move, usually because they were logged in. I usually run the mloop4 process all day long and take a look at the people who wouldn't move the next day.

    I direct the standard out and standard error from the mdist4 pass with the command:

      #cgxx> mdist4 NN > NN/`hostname -s`.log 2>&1
    

    The first refinement is to kill processes for people who have been logged in since last January. If you find processes that have been on for three months there's a good chance they're stuck.

    The second stab usually consists of creating dummy directories for the people who are expiring and don't have home directories any more. The funkydir script in the maildist directory can be used to do this. The list of missing directories can be found in the hostname.log file. The next pass of the redistribute process will move them and then the accounting can delete their new directory that got moved (this is easier than adding special cases to the movemail script (I think)).

    When all else fails active users that are on continuously can be sent a notice asking them to log out or they can be HUP'd and then forcibly moved.


    Ken Lowe
    Email -- ken@u.washington.edu
    Web -- http://staff.washington.edu/krl/