EMR

boto.emr

This module provies an interface to the Elastic MapReduce (EMR) service from AWS.

boto.emr.connection

Represents a connection to the EMR service

class boto.emr.connection.EmrConnection(aws_access_key_id=None, aws_secret_access_key=None, is_secure=True, port=None, proxy=None, proxy_port=None, proxy_user=None, proxy_pass=None, debug=0, https_connection_factory=None, region=None, path='/')
APIVersion = '2009-03-31'
DebuggingArgs = 's3n://us-east-1.elasticmapreduce/libs/state-pusher/0.1/fetch'
DebuggingJar = 's3n://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar'
DefaultRegionEndpoint = 'elasticmapreduce.amazonaws.com'
DefaultRegionName = 'us-east-1'
ResponseError

alias of EmrResponseError

add_instance_groups(jobflow_id, instance_groups)

Adds instance groups to a running cluster.

Parameters:
  • jobflow_id (str) – The id of the jobflow which will take the new instance groups
  • instance_groups (list(boto.emr.InstanceGroup)) – A list of instance groups to add to the job
add_jobflow_steps(jobflow_id, steps)

Adds steps to a jobflow

Parameters:
  • jobflow_id (str) – The job flow id
  • steps (list(boto.emr.Step)) – A list of steps to add to the job
describe_jobflow(jobflow_id)

Describes a single Elastic MapReduce job flow

Parameters:jobflow_id (str) – The job flow id of interest
describe_jobflows(states=None, jobflow_ids=None, created_after=None, created_before=None)

Retrieve all the Elastic MapReduce job flows on your account

Parameters:
  • states (list) – A list of strings with job flow states wanted
  • jobflow_ids (list) – A list of job flow IDs
  • created_after (datetime) – Bound on job flow creation time
  • created_before (datetime) – Bound on job flow creation time
modify_instance_groups(instance_group_ids, new_sizes)

Modify the number of nodes and configuration settings in an instance group.

Parameters:
  • instance_group_ids (list(str)) – A list of the ID’s of the instance groups to be modified
  • new_sizes (list(int)) – A list of the new sizes for each instance group
run_jobflow(name, log_uri, ec2_keyname=None, availability_zone=None, master_instance_type='m1.small', slave_instance_type='m1.small', num_instances=1, action_on_failure='TERMINATE_JOB_FLOW', keep_alive=False, enable_debugging=False, hadoop_version=None, steps=[], bootstrap_actions=[], instance_groups=None, additional_info=None, ami_version='1.0', api_params=None)

Runs a job flow :type name: str :param name: Name of the job flow

Parameters:
  • log_uri (str) – URI of the S3 bucket to place logs
  • ec2_keyname (str) – EC2 key used for the instances
  • availability_zone (str) – EC2 availability zone of the cluster
  • master_instance_type (str) – EC2 instance type of the master
  • slave_instance_type (str) – EC2 instance type of the slave nodes
  • num_instances (int) – Number of instances in the Hadoop cluster
  • action_on_failure (str) – Action to take if a step terminates
  • keep_alive (bool) – Denotes whether the cluster should stay alive upon completion
  • enable_debugging (bool) – Denotes whether AWS console debugging should be enabled.
  • hadoop_version (str) – Version of Hadoop to use. If ami_version is not set, defaults to ‘0.20’ for backwards compatibility with older versions of boto.
  • steps (list(boto.emr.Step)) – List of steps to add with the job
  • bootstrap_actions (list(boto.emr.BootstrapAction)) – List of bootstrap actions that run before Hadoop starts.
  • instance_groups (list(boto.emr.InstanceGroup)) – Optional list of instance groups to use when creating this job. NB: When provided, this argument supersedes num_instances and master/slave_instance_type.
  • ami_version (str) – Amazon Machine Image (AMI) version to use for instances. Values accepted by EMR are ‘1.0’, ‘2.0’, and ‘latest’; EMR currently defaults to ‘1.0’ if you don’t set ‘ami_version’.
  • additional_info (JSON str) – A JSON string for selecting additional features
  • api_params (dict) – a dictionary of additional parameters to pass directly to the EMR API (so you don’t have to upgrade boto to use new EMR features). You can also delete an API parameter by setting it to None.
Return type:

str

Returns:

The jobflow id

set_termination_protection(jobflow_id, termination_protection_status)

Set termination protection on specified Elastic MapReduce job flows

Parameters:
  • jobflow_ids (list or str) – A list of job flow IDs
  • termination_protection_status (bool) – Termination protection status
terminate_jobflow(jobflow_id)

Terminate an Elastic MapReduce job flow

Parameters:jobflow_id (str) – A jobflow id
terminate_jobflows(jobflow_ids)

Terminate an Elastic MapReduce job flow

Parameters:jobflow_ids (list) – A list of job flow IDs

boto.emr.step

class boto.emr.step.JarStep(name, jar, main_class=None, action_on_failure='TERMINATE_JOB_FLOW', step_args=None)

Custom jar step

A elastic mapreduce step that executes a jar

Parameters:
  • name (str) – The name of the step
  • jar (str) – S3 URI to the Jar file
  • main_class (str) – The class to execute in the jar
  • action_on_failure (str) – An action, defined in the EMR docs to take on failure.
  • step_args (list(str)) – A list of arguments to pass to the step
args()
jar()
main_class()
class boto.emr.step.Step

Jobflow Step base class

args()
Return type:list(str)
Returns:List of arguments for the step
jar()
Return type:str
Returns:URI to the jar
main_class()
Return type:str
Returns:The main class name
class boto.emr.step.StreamingStep(name, mapper, reducer=None, combiner=None, action_on_failure='TERMINATE_JOB_FLOW', cache_files=None, cache_archives=None, step_args=None, input=None, output=None, jar='/home/hadoop/contrib/streaming/hadoop-streaming.jar')

Hadoop streaming step

A hadoop streaming elastic mapreduce step

Parameters:
  • name (str) – The name of the step
  • mapper (str) – The mapper URI
  • reducer (str) – The reducer URI
  • combiner (str) – The combiner URI. Only works for Hadoop 0.20 and later!
  • action_on_failure (str) – An action, defined in the EMR docs to take on failure.
  • cache_files (list(str)) – A list of cache files to be bundled with the job
  • cache_archives (list(str)) – A list of jar archives to be bundled with the job
  • step_args (list(str)) – A list of arguments to pass to the step
  • input (str or a list of str) – The input uri
  • output (str) – The output uri
  • jar (str) – The hadoop streaming jar. This can be either a local path on the master node, or an s3:// URI.
args()
jar()
main_class()

boto.emr.emrobject

This module contains EMR response objects

class boto.emr.emrobject.AddInstanceGroupsResponse(connection=None)
Fields = set(['InstanceGroupIds', 'JobFlowId'])
class boto.emr.emrobject.Arg(connection=None)
endElement(name, value, connection)
class boto.emr.emrobject.BootstrapAction(connection=None)
Fields = set(['Path', 'Args', 'Name'])
startElement(name, attrs, connection)
class boto.emr.emrobject.EmrObject(connection=None)
Fields = set([])
endElement(name, value, connection)
startElement(name, attrs, connection)
class boto.emr.emrobject.InstanceGroup(connection=None)
Fields = set(['ReadyDateTime', 'InstanceType', 'InstanceRole', 'EndDateTime', 'InstanceRunningCount', 'State', 'BidPrice', 'Market', 'StartDateTime', 'Name', 'InstanceGroupId', 'CreationDateTime', 'InstanceRequestCount', 'LastStateChangeReason', 'LaunchGroup'])
class boto.emr.emrobject.JobFlow(connection=None)
Fields = set(['TerminationProtected', 'MasterInstanceId', 'State', 'HadoopVersion', 'LogUri', 'AmiVersion', 'Ec2KeyName', 'ReadyDateTime', 'Type', 'JobFlowId', 'CreationDateTime', 'LastStateChangeReason', 'Name', 'EndDateTime', 'Value', 'InstanceCount', 'RequestId', 'StartDateTime', 'SlaveInstanceType', 'AvailabilityZone', 'MasterPublicDnsName', 'NormalizedInstanceHours', 'MasterInstanceType', 'KeepJobFlowAliveWhenNoSteps', 'Id'])
startElement(name, attrs, connection)
class boto.emr.emrobject.KeyValue(connection=None)
Fields = set(['Value', 'Key'])
class boto.emr.emrobject.ModifyInstanceGroupsResponse(connection=None)
Fields = set(['RequestId'])
class boto.emr.emrobject.RunJobFlowResponse(connection=None)
Fields = set(['JobFlowId'])
class boto.emr.emrobject.Step(connection=None)
Fields = set(['Name', 'EndDateTime', 'Jar', 'ActionOnFailure', 'State', 'MainClass', 'StartDateTime', 'CreationDateTime', 'LastStateChangeReason'])
startElement(name, attrs, connection)