EMR

boto.emr

This module provies an interface to the Elastic MapReduce (EMR) service from AWS.

boto.emr.connection

Represents a connection to the EMR service

class boto.emr.connection.EmrConnection(aws_access_key_id=None, aws_secret_access_key=None, is_secure=True, port=None, proxy=None, proxy_port=None, proxy_user=None, proxy_pass=None, debug=0, https_connection_factory=None, region=None, path='/')
APIVersion = '2009-03-31'
DebuggingArgs = 's3n://us-east-1.elasticmapreduce/libs/state-pusher/0.1/fetch'
DebuggingJar = 's3n://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar'
DefaultRegionEndpoint = 'elasticmapreduce.amazonaws.com'
DefaultRegionName = 'us-east-1'
ResponseError

alias of EmrResponseError

add_jobflow_steps(jobflow_id, steps)

Adds steps to a jobflow

Parameters:
  • jobflow_id (str) – The job flow id
  • steps (list(boto.emr.Step)) – A list of steps to add to the job
describe_jobflow(jobflow_id)

Describes a single Elastic MapReduce job flow

Parameters:jobflow_id (str) – The job flow id of interest
describe_jobflows(states=None, jobflow_ids=None, created_after=None, created_before=None)

Retrieve all the Elastic MapReduce job flows on your account

Parameters:
  • states (list) – A list of strings with job flow states wanted
  • jobflow_ids (list) – A list of job flow IDs
  • created_after (datetime) – Bound on job flow creation time
  • created_before (datetime) – Bound on job flow creation time
run_jobflow(name, log_uri, ec2_keyname=None, availability_zone=None, master_instance_type='m1.small', slave_instance_type='m1.small', num_instances=1, action_on_failure='TERMINATE_JOB_FLOW', keep_alive=False, enable_debugging=False, hadoop_version='0.18', steps=[], bootstrap_actions=[])

Runs a job flow

Parameters:
  • name (str) – Name of the job flow
  • log_uri (str) – URI of the S3 bucket to place logs
  • ec2_keyname (str) – EC2 key used for the instances
  • availability_zone (str) – EC2 availability zone of the cluster
  • master_instance_type (str) – EC2 instance type of the master
  • slave_instance_type (str) – EC2 instance type of the slave nodes
  • num_instances (int) – Number of instances in the Hadoop cluster
  • action_on_failure (str) – Action to take if a step terminates
  • keep_alive (bool) – Denotes whether the cluster should stay alive upon completion
  • enable_debugging (bool) – Denotes whether AWS console debugging should be enabled.
  • steps (list(boto.emr.Step)) – List of steps to add with the job
Return type:

str

Returns:

The jobflow id

terminate_jobflow(jobflow_id)

Terminate an Elastic MapReduce job flow

Parameters:jobflow_id (str) – A jobflow id
terminate_jobflows(jobflow_ids)

Terminate an Elastic MapReduce job flow

Parameters:jobflow_ids (list) – A list of job flow IDs

boto.emr.step

class boto.emr.step.JarStep(name, jar, main_class=None, action_on_failure='TERMINATE_JOB_FLOW', step_args=None)

Custom jar step

A elastic mapreduce step that executes a jar

Parameters:
  • name (str) – The name of the step
  • jar (str) – S3 URI to the Jar file
  • main_class (str) – The class to execute in the jar
  • action_on_failure (str) – An action, defined in the EMR docs to take on failure.
  • step_args (list(str)) – A list of arguments to pass to the step
args()
jar()
main_class()
class boto.emr.step.Step

Jobflow Step base class

args()
Return type:list(str)
Returns:List of arguments for the step
jar()
Return type:str
Returns:URI to the jar
main_class()
Return type:str
Returns:The main class name
class boto.emr.step.StreamingStep(name, mapper, reducer=None, action_on_failure='TERMINATE_JOB_FLOW', cache_files=None, cache_archives=None, step_args=None, input=None, output=None, jar='/home/hadoop/contrib/streaming/hadoop-streaming.jar')

Hadoop streaming step

A hadoop streaming elastic mapreduce step

Parameters:
  • name (str) – The name of the step
  • mapper (str) – The mapper URI
  • reducer (str) – The reducer URI
  • action_on_failure (str) – An action, defined in the EMR docs to take on failure.
  • cache_files (list(str)) – A list of cache files to be bundled with the job
  • cache_archives (list(str)) – A list of jar archives to be bundled with the job
  • step_args (list(str)) – A list of arguments to pass to the step
  • input (str or a list of str) – The input uri
  • output (str) – The output uri
  • jar (str) – The hadoop streaming jar. This can be either a local path on the master node, or an s3:// URI.
args()
jar()
main_class()

boto.emr.emrobject

This module contains EMR response objects

class boto.emr.emrobject.AddInstanceGroupsResponse(connection=None)
Fields = set(['InstanceGroupIds', 'JobFlowId'])
class boto.emr.emrobject.Arg(connection=None)
endElement(name, value, connection)
class boto.emr.emrobject.BootstrapAction(connection=None)
Fields = set(['Path', 'Args', 'Name'])
startElement(name, attrs, connection)
class boto.emr.emrobject.EmrObject(connection=None)
Fields = set([])
endElement(name, value, connection)
startElement(name, attrs, connection)
class boto.emr.emrobject.InstanceGroup(connection=None)
Fields = set(['ReadyDateTime', 'InstanceType', 'InstanceRole', 'EndDateTime', 'InstanceRunningCount', 'State', 'BidPrice', 'Market', 'StartDateTime', 'Name', 'InstanceGroupId', 'CreationDateTime', 'InstanceRequestCount', 'LastStateChangeReason', 'LaunchGroup'])
class boto.emr.emrobject.JobFlow(connection=None)
Fields = set(['TerminationProtected', 'MasterInstanceId', 'State', 'HadoopVersion', 'LogUri', 'Ec2KeyName', 'ReadyDateTime', 'Type', 'JobFlowId', 'CreationDateTime', 'LastStateChangeReason', 'Name', 'EndDateTime', 'Value', 'InstanceCount', 'RequestId', 'StartDateTime', 'SlaveInstanceType', 'AvailabilityZone', 'MasterPublicDnsName', 'NormalizedInstanceHours', 'MasterInstanceType', 'KeepJobFlowAliveWhenNoSteps', 'Id'])
startElement(name, attrs, connection)
class boto.emr.emrobject.KeyValue(connection=None)
Fields = set(['Value', 'Key'])
class boto.emr.emrobject.ModifyInstanceGroupsResponse(connection=None)
Fields = set(['RequestId'])
class boto.emr.emrobject.RunJobFlowResponse(connection=None)
Fields = set(['JobFlowId'])
class boto.emr.emrobject.Step(connection=None)
Fields = set(['Name', 'EndDateTime', 'Jar', 'ActionOnFailure', 'State', 'MainClass', 'StartDateTime', 'CreationDateTime', 'LastStateChangeReason'])
startElement(name, attrs, connection)