tools/telemetry/third_party/gsutilz/gslib/addlhelp/prod.py

   1 # -*- coding: utf-8 -*-
   2 # Copyright 2012 Google Inc. All Rights Reserved.
   3 #
   4 # Licensed under the Apache License, Version 2.0 (the "License");
   5 # you may not use this file except in compliance with the License.
   6 # You may obtain a copy of the License at
   7 #
   8 #     http://www.apache.org/licenses/LICENSE-2.0
   9 #
  10 # Unless required by applicable law or agreed to in writing, software
  11 # distributed under the License is distributed on an "AS IS" BASIS,
  12 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  13 # See the License for the specific language governing permissions and
  14 # limitations under the License.
  15 """Additional help about using gsutil for production tasks."""
  16
  17 from __future__ import absolute_import
  18
  19 from gslib.help_provider import HelpProvider
  20
  21 _DETAILED_HELP_TEXT = ("""
  22 <B>OVERVIEW</B>
  23   If you use gsutil in large production tasks (such as uploading or
  24   downloading many GiBs of data each night), there are a number of things
  25   you can do to help ensure success. Specifically, this section discusses
  26   how to script large production tasks around gsutil's resumable transfer
  27   mechanism.
  28
  29
  30 <B>BACKGROUND ON RESUMABLE TRANSFERS</B>
  31   First, it's helpful to understand gsutil's resumable transfer mechanism,
  32   and how your script needs to be implemented around this mechanism to work
  33   reliably. gsutil uses resumable transfer support when you attempt to upload
  34   or download a file larger than a configurable threshold (by default, this
  35   threshold is 2 MiB). When a transfer fails partway through (e.g., because of
  36   an intermittent network problem), gsutil uses a truncated randomized binary
  37   exponential backoff-and-retry strategy that by default will retry transfers up
  38   to 6 times over a 63 second period of time (see "gsutil help retries" for
  39   details). If the transfer fails each of these attempts with no intervening
  40   progress, gsutil gives up on the transfer, but keeps a "tracker" file for
  41   it in a configurable location (the default location is ~/.gsutil/, in a file
  42   named by a combination of the SHA1 hash of the name of the bucket and object
  43   being transferred and the last 16 characters of the file name). When transfers
  44   fail in this fashion, you can rerun gsutil at some later time (e.g., after
  45   the networking problem has been resolved), and the resumable transfer picks
  46   up where it left off.
  47
  48
  49 <B>SCRIPTING DATA TRANSFER TASKS</B>
  50   To script large production data transfer tasks around this mechanism,
  51   you can implement a script that runs periodically, determines which file
  52   transfers have not yet succeeded, and runs gsutil to copy them. Below,
  53   we offer a number of suggestions about how this type of scripting should
  54   be implemented:
  55
  56   1. When resumable transfers fail without any progress 6 times in a row
  57      over the course of up to 63 seconds, it probably won't work to simply
  58      retry the transfer immediately. A more successful strategy would be to
  59      have a cron job that runs every 30 minutes, determines which transfers
  60      need to be run, and runs them. If the network experiences intermittent
  61      problems, the script picks up where it left off and will eventually
  62      succeed (once the network problem has been resolved).
  63
  64   2. If your business depends on timely data transfer, you should consider
  65      implementing some network monitoring. For example, you can implement
  66      a task that attempts a small download every few minutes and raises an
  67      alert if the attempt fails for several attempts in a row (or more or less
  68      frequently depending on your requirements), so that your IT staff can
  69      investigate problems promptly. As usual with monitoring implementations,
  70      you should experiment with the alerting thresholds, to avoid false
  71      positive alerts that cause your staff to begin ignoring the alerts.
  72
  73   3. There are a variety of ways you can determine what files remain to be
  74      transferred. We recommend that you avoid attempting to get a complete
  75      listing of a bucket containing many objects (e.g., tens of thousands
  76      or more). One strategy is to structure your object names in a way that
  77      represents your transfer process, and use gsutil prefix wildcards to
  78      request partial bucket listings. For example, if your periodic process
  79      involves downloading the current day's objects, you could name objects
  80      using a year-month-day-object-ID format and then find today's objects by
  81      using a command like gsutil ls "gs://bucket/2011-09-27-*". Note that it
  82      is more efficient to have a non-wildcard prefix like this than to use
  83      something like gsutil ls "gs://bucket/*-2011-09-27". The latter command
  84      actually requests a complete bucket listing and then filters in gsutil,
  85      while the former asks Google Storage to return the subset of objects
  86      whose names start with everything up to the "*".
  87
  88      For data uploads, another technique would be to move local files from a "to
  89      be processed" area to a "done" area as your script successfully copies
  90      files to the cloud. You can do this in parallel batches by using a command
  91      like:
  92
  93        gsutil -m cp -r to_upload/subdir_$i gs://bucket/subdir_$i
  94
  95      where i is a shell loop variable. Make sure to check the shell $status
  96      variable is 0 after each gsutil cp command, to detect if some of the copies
  97      failed, and rerun the affected copies.
  98
  99      With this strategy, the file system keeps track of all remaining work to
 100      be done.
 101
 102   4. If you have really large numbers of objects in a single bucket
 103      (say hundreds of thousands or more), you should consider tracking your
 104      objects in a database instead of using bucket listings to enumerate
 105      the objects. For example this database could track the state of your
 106      downloads, so you can determine what objects need to be downloaded by
 107      your periodic download script by querying the database locally instead
 108      of performing a bucket listing.
 109
 110   5. Make sure you don't delete partially downloaded files after a transfer
 111      fails: gsutil picks up where it left off (and performs an MD5 check of
 112      the final downloaded content to ensure data integrity), so deleting
 113      partially transferred files will cause you to lose progress and make
 114      more wasteful use of your network. You should also make sure whatever
 115      process is waiting to consume the downloaded data doesn't get pointed
 116      at the partially downloaded files. One way to do this is to download
 117      into a staging directory and then move successfully downloaded files to
 118      a directory where consumer processes will read them.
 119
 120   6. If you have a fast network connection, you can speed up the transfer of
 121      large numbers of files by using the gsutil -m (multi-threading /
 122      multi-processing) option. Be aware, however, that gsutil doesn't attempt to
 123      keep track of which files were downloaded successfully in cases where some
 124      files failed to download. For example, if you use multi-threaded transfers
 125      to download 100 files and 3 failed to download, it is up to your scripting
 126      process to determine which transfers didn't succeed, and retry them. A
 127      periodic check-and-run approach like outlined earlier would handle this
 128      case.
 129
 130      If you use parallel transfers (gsutil -m) you might want to experiment with
 131      the number of threads being used (via the parallel_thread_count setting
 132      in the .boto config file). By default, gsutil uses 10 threads for Linux
 133      and 24 threads for other operating systems. Depending on your network
 134      speed, available memory, CPU load, and other conditions, this may or may
 135      not be optimal. Try experimenting with higher or lower numbers of threads
 136      to find the best number of threads for your environment.
 137
 138 <B>RUNNING GSUTIL ON MULTIPLE MACHINES</B>
 139   When running gsutil on multiple machines that are all attempting to use the
 140   same OAuth2 refresh token, it is possible to encounter rate limiting errors
 141   for the refresh requests (especially if all of these machines are likely to
 142   start running gsutil at the same time). To account for this, gsutil will
 143   automatically retry OAuth2 refresh requests with a truncated randomized
 144   exponential backoff strategy like that which is described in the
 145   "BACKGROUND ON RESUMABLE TRANSFERS" section above. The number of retries
 146   attempted for OAuth2 refresh requests can be controlled via the
 147   "oauth2_refresh_retries" variable in the .boto config file.
 148 """)
 149
 150
 151 class CommandOptions(HelpProvider):
 152   """Additional help about using gsutil for production tasks."""
 153
 154   # Help specification. See help_provider.py for documentation.
 155   help_spec = HelpProvider.HelpSpec(
 156       help_name='prod',
 157       help_name_aliases=[
 158           'production', 'resumable', 'resumable upload', 'resumable transfer',
 159           'resumable download', 'scripts', 'scripting'],
 160       help_type='additional_help',
 161       help_one_line_summary='Scripting Production Transfers',
 162       help_text=_DETAILED_HELP_TEXT,
 163       subcommand_help_text={},
 164   )