1) Determination of the number of mappers in Mapreduce: Before reading data in the map stage, FileInputFormat will split the input file into split. The number of splits determines the number of maps. The factors affecting the number of maps, that is, the number of splits are: 1) The size of the HDFS block, that is, the value of dfs.block.size in HDFS. If there is an input file of 1024m, when the block is 256m, it will be divided into 4 splits; when the block is 128m, it will be divided into 8 splits.
2) The size of the file. When the block is 128m, if the input file is 128m, it will be divided into 1 split; when the block is 256m, it will be divided into 2 splits.
3) The number of files. FileInputFormat splits splits by file and only splits large files, those that are larger than the size of the HDFS block. If dfs.block.size is set to 64m in HDFS and there are 100 files in the input directory, the number of splits after division is at least 100.
4) The size of the splitsize. Fragmentation is divided according to the size of splitszie. If the size of a split is not set, the default is equal to the size of the hdfs block. But the application can adjust the splitsize with two parameters. The calculation of the number of Mappers is as follows:
Step1, splitsize=max(minimumsize, min(maximumsize, blockssize)). If minimumsize and maximumsize are not set, the size of splitsize is equal to blocksize by default.
Step2, the calculation process can be simplified to the following formula, the detailed algorithm can refer to the getSplits method in the FileInputSplit class.
Total_split for(file : enter each file in the directory)
{ file_split = 1;
If(file.size)splitsize)
{ file_split=file_size/splitsize;
}
Total_split+=file_split; }
The number of Reducers in Mapreduce is determined:1. By default, a mapreduce job has only one reducer; in a large cluster, many reducers are needed, and the intermediate data is processed in a reducer. If the number of reducers is insufficient, it will become a computational bottleneck. 2, the optimal number of reducers is related to the number of task slots available for the reducer in the cluster. Generally, the number of reducers is slightly less than the total number of slots; two formulas are recommended in the Hadoop documentation: 0.95*NUMBER_OF_NODES*mapred.tasktracker. Reduce.tasks.maximum 1.75*NUMBER_OF_NODES*mapred.tasktracker.reduce.tasks.maximum
Remarks: NUMBER_OF_NODES is the number of compute nodes in the cluster; mapred.tasktracker.reduce.tasks.maximum: the number of reducer task slots allocated by each node (number of node cores);
2, in the code by: JobConf.setNumReduceTasks (Int numOfReduceTasks) method to set the number of reducers;
Hive job related parameter configuration and mapreduce number control
Configure the following performance tuning items in hive\conf\hive_site.xml:
Turn on dynamic partitioning: hive.exec.dynamic.parTITIon=true
Default: false
Description: Whether to allow dynamic partitioning hive.exec.dynamic.parTITIon.mode=nonstrict
Default: strict
Description: strict is to avoid the full partition field is dynamic, there must be at least one partition field is specified to have a value.
You can not specify a partition when reading a table. Hive.exec.max.dynamic.partitions.pernode=100
Default: 100
Description: The maximum number of dynamic partitions that each mapper or reducer can create hive.exec.max.dynamic.partitions=1000
Default: 1000
Description: The maximum number of dynamic partitions a DML operation can create. hive.exec.max.created.files=100000 Default: 100000
Description: The number of files that can be created by a DML operation is set to the following parameters to cancel some restrictions (HIVE 0.7 does not have this limitation): hive.merge.mapfiles=false
Default: true
Description: Whether to merge the output files of the Map, that is, merge the small files into a map hive.merge.mapredfiles=false
Default: false
Description: Whether to merge the output file of Reduce, that is, do a reduce operation in the Map output stage, and then output hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; to indicate small file merge configuration before execution. Hive's local mode can be enabled with the following parameters: hive.exec.mode.local.auto=true; (default is false) Since version 0.7, Hive started to support task execution to select local mode (local mode), thus, the data is Smaller operations can be performed locally, which is much faster than submitting tasks to the cluster.
Mapred.reduce.tasks; Set the current session map, the number of reduce, the default value is -1, for the system to automatically match.
First, control the number of maps in the hive task:
1. Normally, the job will generate one or more map tasks through the input directory. The main determinants are: the total number of input files, the file size of the input, the file block size set by the cluster (the value of dfs.block.size in hadoop\hdfs_site.xml; the set dfs.block.size command in HIVE) As you can see, this parameter cannot be customized in HIVE);
2. Example:
a) Assuming that there is a file a in the input directory with a size of 780M, then Hadoop will divide the file a into 7 blocks (6 128m blocks and 1 12m block), resulting in 7 map numbers.
b) Assuming that there are 3 files a, b, and c in the input directory, the size is 10m, 20m, 130m, then hadoop will be divided into 4 blocks (10m, 20m, 128m, 2m), resulting in 4 maps.
That is, if the file is larger than the block size (128m), it will be split, and if it is smaller than the block size, the file will be treated as a block.
3. Is the number of maps as high as possible? the answer is negative. If a task has many small files (far less than the block size of 128m), then each small file will be treated as a block, using a map task, and a map task will be started and initialized much longer than the logical processing. Time will cause a lot of waste of resources. Moreover, the number of maps that can be executed at the same time is limited.
4. Is it guaranteed that each map handles a block of files close to 128m?
The answer is not necessarily. For example, if there is a 127m file, it will be completed with a map. However, this file has only one or two small fields, but there are tens of millions of records. If the logic of the map processing is more complicated, use a map task to do it. It is also time consuming.
For the above problems 3 and 4, we need to solve in two ways: reduce the number of maps and increase the number of maps;
How to merge small files and reduce the number of maps?
Suppose a SQL task:
Select count(1) from popt_tbaccountcopy_mes where pt = '2012-07-04';
The task's inputdir/group/p_sdo_data/p_sdo_data_etl/pt/popt_tbaccountcopy_mes/pt=2012-07-04 has a total of 194 files, many of which are small files that are much smaller than 128m, with a total size of 9G. Normal execution will use 194 map tasks. . The total computational resources consumed by Map: SLOTS_MILLIS_MAPS= 623,020 Use the following method to merge small files before the map is executed, reducing the number of maps:
Set mapred.max.split.size=100000000;
Set mapred.min.split.size.per.node=100000000;
Set mapred.min.split.size.per.rack=100000000;
Set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
Then execute the above statement, using 74 map tasks, the computing resources consumed by the map:
SLOTS_MILLIS_MAPS= 333,500
For this simple SQL task, the execution time may be similar, but saves half of the computing resources. Probably explain, 100000000 means 100M,
Set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
This parameter indicates a small file merge before execution. The first three parameters determine the size of the merged file block, larger than the file block size of 128m, separated by 128m, less than 128m, greater than 100m, separated by 100m, those smaller than 100m (including small files and large files left) ), merged, and finally generated 74 blocks. How to increase the number of maps properly? When the input file is large, the task logic is complex, and the map execution is very slow, you can consider increasing the number of maps to reduce the amount of data processed by each map, thereby improving the execution efficiency of the task. Suppose there is such a task:
Select data_desc,
Count(1),
Count(distinct id),
Sum(case when ...), sum(case when...),
Sum(...) from a group by data_desc
If table a has only one file, the size is 120M, but it contains tens of millions of records. If you use 1 map to complete this task, it is definitely time consuming. In this case, we should consider making this file reasonable. Split into multiples so that you can do it with multiple map tasks.
Set mapred.reduce.tasks=10;
Create table a_1 as select * from a distribute by rand(123);
This will randomly record the records of the a table into the a_1 table containing 10 files, and then use a_1 instead of the a table in the above sql, it will be completed with 10 map tasks. Each map task handles data larger than 12M (several million records), and the efficiency will definitely be much better. It seems that there are some contradictions between the two. One is to merge small files, the other is to split large files into small files. This is the key point to focus on. According to the actual situation, the number of maps needs to follow two principles. : Make large data volumes use the appropriate number of maps; enable a single map task to handle the appropriate amount of data;
Second, control the number of reduce of the hive task: 1. How does Hive determine the number of reduce:The setting of reduce number greatly affects the efficiency of task execution. Without specifying the number of reduce (mapred.reduce.tasks = -1), Hive will guess to determine the number of reduce, based on the following two settings: hive. Exec.reducers.bytes.per.reducer (the amount of data processed by each reduce task, the default is 1000^3=1G) hive.exec.reducers.max (the maximum number of reduce per task, the default is 999) Calculate the number of reducers The formula is very simple N = min (parameter 2, total input data / parameter 1) That is, if the total input of reduce (map output) does not exceed 1G, then there will only be one reduce task; such as: select pt, count (1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group by pt; /group/p_sdo_data/p_sdo_data_etl/pt/popt_tbaccountcopy_mes/pt=2012-07-04 The total size is more than 9G, so this sentence has 10 reduce
2. Adjust the number of reduce methods:Adjust the value of the hive.exec.reducers.bytes.per.reducer parameter;
Set hive.exec.reducers.bytes.per.reducer=500000000; (500M) select pt,count(1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group by pt; this time there are 20 reduce 3. adjust Reduce number method two:
Set mapred.reduce.tasks = 15; select pt,count(1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group by pt; this time there are 15 reduce
3. The number of reduce is not the more the better; like map, starting and initializing reduce will also consume time and resources; In addition, how many reduce, there will be many output files, if many small files are generated, Then if these small files are used as input to the next task, there will also be problems with too many small files;
Under what circumstances is there only one reduce?Many times you will find that regardless of the amount of data in the task, whether you have set the parameters to adjust the number of reduce, there is always only one reduce task in the task; in fact, there is only one reduce task, except that the data volume is less than hive.exec.reducers In addition to the .bytes.per.reducer parameter value, there are the following reasons:
a) There is no group by summary, such as select pt, count(1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group by pt;
Write select count(1) from popt_tbaccountcopy_mes where pt = '2012-07-04';
b) used Order by
Cartesian product
Usually in these cases, in addition to finding ways to work around and avoid, I have no good way, because these operations are global, so hadoop had to use a reduce to complete;
Similarly, you need to consider these two principles when setting the number of reduce: Make the large amount of data use the appropriate reduce number; make a single reduce task handle the appropriate amount of data. The hive.exec.parallel parameter controls whether different jobs in the same sql can run at the same time. The default is false.
The following is the test procedure for this parameter:
Test sql: select r1.a from (select ta from sunwg_10 t join sunwg_10000000 s on ta=sb) r1 join (select sb from sunwg_100000 t join sunwg_10 s on ta=sb) r2 on (r1.a=r2.b);
When the parameter is false, the three jobs are executed in the order of set hive.exec.parallel=false, but it can be seen that the sql in the two subqueries is irrelevant, you can run set hive.exec.parallel=true in parallel. Summary: hive.exec.parallel will make sql running concurrent jobs run faster when resources are sufficient, but at the same time consume more resources to evaluate whether hive.exec.parallel is helpful for our refresh task.
Draw-wire sensors of the wire sensor series measure with high linearity across the entire measuring range and are used for distance and position measurements of 100mm up to 20,000mm. Draw-wire sensors from LANDER are ideal for integration and subsequent assembly in serial OEM applications, e.g., in medical devices, lifts, conveyors and automotive engineering.
Linear Encoder,Digital Linear Encoder,Draw Wire Sensor,1500Mm Linear Encoder
Jilin Lander Intelligent Technology Co., Ltd , https://www.jilinlandermotor.com