EMR
boto.emr
This module provies an interface to the Elastic MapReduce (EMR)
service from AWS.
-
boto.emr.connect_to_region(region_name, **kw_params)
-
boto.emr.regions()
Get all available regions for the Amazon Elastic MapReduce service.
Return type: | list |
Returns: | A list of boto.regioninfo.RegionInfo |
boto.emr.connection
Represents a connection to the EMR service
-
class boto.emr.connection.EmrConnection(aws_access_key_id=None, aws_secret_access_key=None, is_secure=True, port=None, proxy=None, proxy_port=None, proxy_user=None, proxy_pass=None, debug=0, https_connection_factory=None, region=None, path='/', security_token=None, validate_certs=True)
-
APIVersion = '2009-03-31'
-
DebuggingArgs = 's3n://us-east-1.elasticmapreduce/libs/state-pusher/0.1/fetch'
-
DebuggingJar = 's3n://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar'
-
DefaultRegionEndpoint = 'elasticmapreduce.us-east-1.amazonaws.com'
-
DefaultRegionName = 'us-east-1'
-
ResponseError
alias of EmrResponseError
-
add_instance_groups(jobflow_id, instance_groups)
Adds instance groups to a running cluster.
Parameters: |
- jobflow_id (str) – The id of the jobflow which will take the
new instance groups
- instance_groups (list(boto.emr.InstanceGroup)) – A list of instance groups to add to the job
|
-
add_jobflow_steps(jobflow_id, steps)
Adds steps to a jobflow
Parameters: |
- jobflow_id (str) – The job flow id
- steps (list(boto.emr.Step)) – A list of steps to add to the job
|
-
describe_jobflow(jobflow_id)
Describes a single Elastic MapReduce job flow
Parameters: | jobflow_id (str) – The job flow id of interest |
-
describe_jobflows(states=None, jobflow_ids=None, created_after=None, created_before=None)
Retrieve all the Elastic MapReduce job flows on your account
Parameters: |
- states (list) – A list of strings with job flow states wanted
- jobflow_ids (list) – A list of job flow IDs
- created_after (datetime) – Bound on job flow creation time
- created_before (datetime) – Bound on job flow creation time
|
-
modify_instance_groups(instance_group_ids, new_sizes)
Modify the number of nodes and configuration settings in an
instance group.
Parameters: |
- instance_group_ids (list(str)) – A list of the ID’s of the instance
groups to be modified
- new_sizes (list(int)) – A list of the new sizes for each instance group
|
-
run_jobflow(name, log_uri=None, ec2_keyname=None, availability_zone=None, master_instance_type='m1.small', slave_instance_type='m1.small', num_instances=1, action_on_failure='TERMINATE_JOB_FLOW', keep_alive=False, enable_debugging=False, hadoop_version=None, steps=, []bootstrap_actions=, []instance_groups=None, additional_info=None, ami_version=None, api_params=None)
Runs a job flow
:type name: str
:param name: Name of the job flow
Parameters: |
- log_uri (str) – URI of the S3 bucket to place logs
- ec2_keyname (str) – EC2 key used for the instances
- availability_zone (str) – EC2 availability zone of the cluster
- master_instance_type (str) – EC2 instance type of the master
- slave_instance_type (str) – EC2 instance type of the slave nodes
- num_instances (int) – Number of instances in the Hadoop cluster
- action_on_failure (str) – Action to take if a step terminates
- keep_alive (bool) – Denotes whether the cluster should stay
alive upon completion
- enable_debugging (bool) – Denotes whether AWS console debugging
should be enabled.
- hadoop_version (str) – Version of Hadoop to use. This no longer
|
defaults to ‘0.20’ and now uses the AMI default.
Parameters: |
- steps (list(boto.emr.Step)) – List of steps to add with the job
- bootstrap_actions (list(boto.emr.BootstrapAction)) – List of bootstrap actions that run
before Hadoop starts.
- instance_groups (list(boto.emr.InstanceGroup)) – Optional list of instance groups to
use when creating this job.
NB: When provided, this argument supersedes num_instances
and master/slave_instance_type.
- ami_version (str) – Amazon Machine Image (AMI) version to use
for instances. Values accepted by EMR are ‘1.0’, ‘2.0’, and
‘latest’; EMR currently defaults to ‘1.0’ if you don’t set
‘ami_version’.
- additional_info (JSON str) – A JSON string for selecting additional features
- api_params (dict) – a dictionary of additional parameters to pass
directly to the EMR API (so you don’t have to upgrade boto to
use new EMR features). You can also delete an API parameter
by setting it to None.
|
Return type: | str
|
Returns: | The jobflow id
|
-
set_termination_protection(jobflow_id, termination_protection_status)
Set termination protection on specified Elastic MapReduce job flows
Parameters: |
- jobflow_ids (list or str) – A list of job flow IDs
- termination_protection_status (bool) – Termination protection status
|
-
terminate_jobflow(jobflow_id)
Terminate an Elastic MapReduce job flow
Parameters: | jobflow_id (str) – A jobflow id |
-
terminate_jobflows(jobflow_ids)
Terminate an Elastic MapReduce job flow
Parameters: | jobflow_ids (list) – A list of job flow IDs |
boto.emr.step
-
class boto.emr.step.HiveBase(name, **kw)
-
BaseArgs = ['s3n://us-east-1.elasticmapreduce/libs/hive/hive-script', '--base-path', 's3n://us-east-1.elasticmapreduce/libs/hive/']
-
class boto.emr.step.HiveStep(name, hive_file, hive_versions='latest', hive_args=None)
Hive script step
-
class boto.emr.step.InstallHiveStep(hive_versions='latest', hive_site=None)
Install Hive on EMR step
-
InstallHiveName = 'Install Hive'
-
class boto.emr.step.InstallPigStep(pig_versions='latest')
Install pig on emr step
-
InstallPigName = 'Install Pig'
-
class boto.emr.step.JarStep(name, jar, main_class=None, action_on_failure='TERMINATE_JOB_FLOW', step_args=None)
Custom jar step
A elastic mapreduce step that executes a jar
Parameters: |
- name (str) – The name of the step
- jar (str) – S3 URI to the Jar file
- main_class (str) – The class to execute in the jar
- action_on_failure (str) – An action, defined in the EMR docs to take on failure.
- step_args (list(str)) – A list of arguments to pass to the step
|
-
args()
-
jar()
-
main_class()
-
class boto.emr.step.PigBase(name, **kw)
-
BaseArgs = ['s3n://us-east-1.elasticmapreduce/libs/pig/pig-script', '--base-path', 's3n://us-east-1.elasticmapreduce/libs/pig/']
-
class boto.emr.step.PigStep(name, pig_file, pig_versions='latest', pig_args=[])
Pig script step
-
class boto.emr.step.ScriptRunnerStep(name, **kw)
-
ScriptRunnerJar = 's3n://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar'
-
class boto.emr.step.Step
Jobflow Step base class
-
args()
Return type: | list(str) |
Returns: | List of arguments for the step |
-
jar()
Return type: | str |
Returns: | URI to the jar |
-
main_class()
Return type: | str |
Returns: | The main class name |
-
class boto.emr.step.StreamingStep(name, mapper, reducer=None, combiner=None, action_on_failure='TERMINATE_JOB_FLOW', cache_files=None, cache_archives=None, step_args=None, input=None, output=None, jar='/home/hadoop/contrib/streaming/hadoop-streaming.jar')
Hadoop streaming step
A hadoop streaming elastic mapreduce step
Parameters: |
- name (str) – The name of the step
- mapper (str) – The mapper URI
- reducer (str) – The reducer URI
- combiner (str) – The combiner URI. Only works for Hadoop 0.20 and later!
- action_on_failure (str) – An action, defined in the EMR docs to take on failure.
- cache_files (list(str)) – A list of cache files to be bundled with the job
- cache_archives (list(str)) – A list of jar archives to be bundled with the job
- step_args (list(str)) – A list of arguments to pass to the step
- input (str or a list of str) – The input uri
- output (str) – The output uri
- jar (str) – The hadoop streaming jar. This can be either a local path on the master node, or an s3:// URI.
|
-
args()
-
jar()
-
main_class()
boto.emr.emrobject
This module contains EMR response objects
-
class boto.emr.emrobject.AddInstanceGroupsResponse(connection=None)
-
Fields = set(['InstanceGroupIds', 'JobFlowId'])
-
class boto.emr.emrobject.Arg(connection=None)
-
endElement(name, value, connection)
-
class boto.emr.emrobject.BootstrapAction(connection=None)
-
Fields = set(['Path', 'Args', 'Name'])
-
startElement(name, attrs, connection)
-
class boto.emr.emrobject.EmrObject(connection=None)
-
Fields = set([])
-
endElement(name, value, connection)
-
startElement(name, attrs, connection)
-
class boto.emr.emrobject.InstanceGroup(connection=None)
-
Fields = set(['ReadyDateTime', 'InstanceType', 'InstanceRole', 'EndDateTime', 'InstanceRunningCount', 'State', 'BidPrice', 'Market', 'StartDateTime', 'Name', 'InstanceGroupId', 'CreationDateTime', 'InstanceRequestCount', 'LastStateChangeReason', 'LaunchGroup'])
-
class boto.emr.emrobject.JobFlow(connection=None)
-
Fields = set(['TerminationProtected', 'MasterInstanceId', 'State', 'HadoopVersion', 'LogUri', 'AmiVersion', 'Ec2KeyName', 'ReadyDateTime', 'Type', 'JobFlowId', 'CreationDateTime', 'LastStateChangeReason', 'Name', 'EndDateTime', 'Value', 'InstanceCount', 'RequestId', 'StartDateTime', 'SlaveInstanceType', 'AvailabilityZone', 'MasterPublicDnsName', 'NormalizedInstanceHours', 'MasterInstanceType', 'KeepJobFlowAliveWhenNoSteps', 'Id'])
-
startElement(name, attrs, connection)
-
class boto.emr.emrobject.KeyValue(connection=None)
-
Fields = set(['Value', 'Key'])
-
class boto.emr.emrobject.ModifyInstanceGroupsResponse(connection=None)
-
Fields = set(['RequestId'])
-
class boto.emr.emrobject.RunJobFlowResponse(connection=None)
-
Fields = set(['JobFlowId'])
-
class boto.emr.emrobject.Step(connection=None)
-
Fields = set(['Name', 'EndDateTime', 'Jar', 'ActionOnFailure', 'State', 'MainClass', 'StartDateTime', 'CreationDateTime', 'LastStateChangeReason'])
-
startElement(name, attrs, connection)