Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

BIG DATA PROGRAMMING PROJECT: ATMOSPHERIC SCIENCE AND CLIMATE

MODEL DATA

STUDENT

COURSE

INSTITUTION
Introduction

By applying big data analysis technologies to the atmospheric science and climate model data

that the European Centre provides for Medium-Range Weather Forecasts (ECMWF), the goal of

this report is to reduce the amount of time required to analyze massive amounts of data to a range

that is considered to be acceptable (ECMWF). The ozone details in each area will be displayed in

two dimensions using this data. The investigation will center on the "Total Ozone Column," and

it will include utilizing MATLAB to examine a three-dimensional grid of atmospheric

composition throughout Europe using numerous climate models. Each site will have one hundred

different chemical species.

The entire dataset is more significant than ten terabytes. The report will compare sequential and

parallel techniques, discuss common data errors encountered in using MATLAB extensive data

analysis data, and suggest the number of processors needed to achieve the goal of analyzing the

data within two hours per day, 25 hours of data, and approximately 250MB of data. These are the

goals of the data analysis project. To achieve these goals, the report will compare sequential and

parallel techniques.

At the beginning of the study, a comparison will be made between the sequential and parallel

techniques available for analyzing big data. The report will emphasize the benefits and

drawbacks of each method. After that, it will discuss the typical problems that arise while using

MATLAB for extensive data analysis, such as data inaccuracies, improper data types, and

inappropriate data formats. Ultimately, it will recommend how many processors are essential to

achieve the specified efficiency level.

In conclusion, this study has outlined the processes necessary to cut the time it takes to analyze

vast amounts of data down to an acceptable range. The comparison of sequential and parallel
methodologies and the explanation of common problems observed while employing MATLAB

extensive data analysis data have been elaborated upon. In addition, the study includes

recommendations for the required number of processors to attain the desired level of

productivity. The efficiency of the analysis can be significantly increased by utilizing suitable

processing methods, and the desired target efficiency can be accomplished with this

improvement.

Overcoming loading nearly 9TB in memory

After analyzing nine terabytes of data, we discovered that a computer's central processing unit

memory could not hold so much information. As a result, we utilized the built-in mathematical

tools of MATLAB and limited our analysis to only one hour's worth of load data. This enabled

us to research on a personal computer, which made it more flexible.

Sequential processing compared to parallel processing

Traditional approaches to data analysis on computers use sequential methods because of the

constraints imposed by the limits of physical techniques such as the CPU (Reed and Dongarra,

2015). For this analysis, the sequential method was chosen since it enables the program to carry

out each instruction in the specified order and guarantees that there is only one active context at

any given time during the program's running. This indicates that when numerous jobs need to be

done, they will need to be processed one after the other, resulting in a drop in efficiency.

Utilizing parallel approaches is one way to reduce the amount of time spent analyzing data. This

is completing several activities simultaneously instead of carrying them out in the order they

were initially assigned, increasing speed and productivity.


The code for the sequential process is built so that a comparison may be made between the speed

and efficiency of sequential and parallel operations. This code sequentially evaluates the data

after taking into account the parameters that the user entered. The code begins by establishing the

beginning and ending times for each stage of the analysis, and then it executes in a manner

limited to a single core. This allows the user to gain an idea of the time estimates for the current

study, as well as an estimate for the overall amount of time it will take to analyze all of the data.

The amount of time spent on the currently active stage will be displayed, along with an estimate

of the total amount of time required for the entire code. The screenshot illustrates an example of

this output.

Parallel processing

Due to the development of current CPUs that feature multi-core, parallel processing capabilities,

the processing method known as parallel processing is becoming an increasingly popular choice.

The use of a multiprocessor enables the execution of numerous tasks in parallel, making parallel

processing possible. Parallelism is the name given to the process that enables many tasks to be

carried out simultaneously. Through the use of parallel processing, it is possible to perform

multiple tasks at the same time, which both speeds up the processing and saves time. A code that

supports parallel processing must first be developed to implement the parallel technique. By

determining the number of cores per CPU run and the number of data sets that will be analyzed
in a given day, this code can help determine the method of data processing that will be the most

effective. Users can execute tasks more effectively and timely when they use parallel processing,

which allows for optimizing processing time and speeding up the work.

Sequential and Parallel outcome testing

During the actual testing process, we discovered that parallel testing is more efficient than

sequential testing. A data set with 250 records and a data set with 5000 records were used in this

study. The test time for sequential processing in the 250 data set was 8.72 seconds. Sequential

processing uses only a single core. Because my CPU only has six cores, the times for parallel
processing (from two cores to six cores) were 7.24 seconds, 5.58 seconds, 5.06 seconds, 4.92

seconds, and 4.73 seconds, respectively. The utilization of several processor cores results in a

discernible increase in speed. In the 5000 data set, the test time for sequential processing (only

using a single core) was 168.04 seconds. Still, the test time for parallel processing (using

anywhere from two to six cores) was 133.08 seconds, 94.57 seconds, 76.93 seconds, 66.34

seconds, and 60.11 seconds, respectively. The time it took for each of the six cores to complete

the task was much less when compared to the time it took for the single core. This is further

evidence of the benefits of parallel processing.

A time comparison chart was generated using mapping software to provide a more in-depth

illustration of this issue. This graphic demonstrates that the amount of time required by each of

the six cores was much less when compared to the time required by the single core, with the

improvement exhibited in the 5000 data set is the most noticeable. This exemplifies the

advantages of parallel processing, as it dramatically reduces the time required to process both

data sets. Overall, the results of this testing procedure have demonstrated that parallel processing

is noticeably more effective than sequential processing and that it can be applied to any activity

to increase both its speed and level of productivity.

250 data 8.72s 7.24s 5.58s 5.06s 4.92s 4.73s

5000 data 168.04s 133.08s 94.57s 76.93s 66.34s 60.11s

When completing the same tasks, it is abundantly clear that completing them in parallel is more

productive than doing so sequentially. According to the line graph analysis, the amount of time

needed to accomplish the task gets shorter as the number of processors in the central processing

unit (CPU) increases. This suggests that parallelism is preferable to sequential processes in

almost every respect.


Testing Code

Its three components need to be evaluated to establish whether or not the parallel strategy is

successful. The code, the data, and the test code are all included in these components. By running

the code through its appropriate motions and verifying the outcomes, we can ensure that it

functions as intended. Examining the data will verify that the data being used is accurate and

valid. Testing the test code will guarantee that it will run without any problems and appropriately

reflect the program's outcomes.

Text error

Testing the software for text errors is vital in ensuring the program's quality. The CreateTestData

Text script was used to build the test file test. nc, which was then used to check the accuracy of

the content contained within this report. To carry out the test, extracting the test file from the

source data file was necessary, as carrying out data type analysis and saving the results in an
array. The experiment was a success, and the outcomes corresponded to what was anticipated, as

demonstrated in the photographs provided below.

NaN test

A mistake frequently occurring in MATLAB is referred to as a not-a-number (NaN) error.

Mathematical operations such as 0/0, Inf/Inf, INF-INF, and Inf*0, which all contribute to

uncertainty and the NaN error, are the root cause of the problem. To check for NaN problems,

you will need to construct a test dataset and run the create test data nan.m script on the test

file.nc file that contains the analysis data. After that, you will compare the actual results with the

expected ones. However, because the data that NaN provides could be more precise in the file, it

is essential to employ conditional judgment in addition to other approaches to pinpoint the

position of the problem. Doing so will save time.


Log file

During the testing phase, maintaining a log file makes it simpler to localize areas where mistakes

have occurred. The error, misplacement, or other problem can be localized with the help of

TestSolutionWithLogFile by annotating the time it occurred in the log file with "xx" hours. The

accompanying picture is an output from a test demonstrating that this solution is effective.

Automated testing

Introduction to Automated Tests

Parallel processing may now be used to analyze the data requested by the client, thanks to the

code that was built. To provide consumers with a "one-stop service," a leading software that

combines all features was developed. This application provides line graphs comparing the

efficiency of parallel and sequential processing, text and NaN tests for test data and code, and

error logs to help clients find and resolve problems. Users can also rapidly alter the target
settings and run the application with a single click while preserving all data for easy review and

extraction.

Results of automated tests

All data was successfully shown in the workbench, and all test results were produced as

expected.

Estimate the number of processors.

Preliminary Prediction of Outcome

Based on testing comparing parallel and sequential methods with the same data, the time

required to review the data decreased as the number of processors rose. The test results are

nearly linear, as evidenced by the line graph.

Estimates of processors using functions


Saving the test results allows you to extrapolate the data and use MATLAB's drawing software

to create a precise line chart displaying the outcomes of parallel processing. It was estimated that

at least 13 cores would be required to get the desired result.

More efficient predictive processor cores

According to research, advances in computing power only occur linearly. For multi-core

computers, the bottleneck caused by packaging bottlenecks is the capacity to expand memory

bandwidth indefinitely, resulting in non-linear performance growth. Furthermore, the operating

system's thread scheduling and switching between CPU cores may result in a non-linear increase

in multi-core performance. An exponential regression function, rather than a linear one, may

yield more reliable estimations of the number of target cores (Xu and Duan, 2019).

Conclusion

This essay is meant to be a comparison of sequential versus parallel programming. The results

show that parallel code outperforms sequential code in terms of efficiency and ability to meet the
client's needs. It was found that a minimum of 13 CPUs would be required to achieve the goal.

Furthermore, a system that performs a routine search for frequent errors was put up to swiftly

zero in on any problems and store their coordinates in a log for later scrutiny by the client. A

single script containing all the functions and settings was created for the user's convenience.
References

Reed, D.A. and Dongarra, J., 2015. Exascale computing and big data. Communications of the

ACM, 58(7), pp.56-68.

Xu, L.D. and Duan, L., 2019. Big data for cyber-physical systems in industry 4.0: a survey.

Enterprise Information Systems, 13(2), pp.148-169.

You might also like