TT156 Edited

BIG DATA PROGRAMMING PROJECT: ATMOSPHERIC SCIENCE AND CLIMATE
MODEL DATA
STUDENT
COURSE
INSTITUTION
Introduction
By applying big data analysis technologies to the atmospheric science and climate model data
that the European Centre provides for Medium-Range Weather Forecasts (ECMWF), the goal of
this report is to reduce the amount of time required to analyze massive amounts of data to a range
that is considered to be acceptable (ECMWF). The ozone details in each area will be displayed in
two dimensions using this data. The investigation will center on the "Total Ozone Column," and
it will include utilizing MATLAB to examine a three-dimensional grid of atmospheric
composition throughout Europe using numerous climate models. Each site will have one hundred
different chemical species.
The entire dataset is more significant than ten terabytes. The report will compare sequential and
parallel techniques, discuss common data errors encountered in using MATLAB extensive data
analysis data, and suggest the number of processors needed to achieve the goal of analyzing the
data within two hours per day, 25 hours of data, and approximately 250MB of data. These are the
goals of the data analysis project. To achieve these goals, the report will compare sequential and
parallel techniques.
At the beginning of the study, a comparison will be made between the sequential and parallel
techniques available for analyzing big data. The report will emphasize the benefits and
drawbacks of each method. After that, it will discuss the typical problems that arise while using
MATLAB for extensive data analysis, such as data inaccuracies, improper data types, and
inappropriate data formats. Ultimately, it will recommend how many processors are essential to
achieve the specified efficiency level.
In conclusion, this study has outlined the processes necessary to cut the time it takes to analyze
vast amounts of data down to an acceptable range. The comparison of sequential and parallel
methodologies and the explanation of common problems observed while employing MATLAB
extensive data analysis data have been elaborated upon. In addition, the study includes
recommendations for the required number of processors to attain the desired level of
productivity. The efficiency of the analysis can be significantly increased by utilizing suitable
processing methods, and the desired target efficiency can be accomplished with this
improvement.
Overcoming loading nearly 9TB in memory
After analyzing nine terabytes of data, we discovered that a computer's central processing unit
memory could not hold so much information. As a result, we utilized the built-in mathematical
tools of MATLAB and limited our analysis to only one hour's worth of load data. This enabled
us to research on a personal computer, which made it more flexible.
Sequential processing compared to parallel processing
Traditional approaches to data analysis on computers use sequential methods because of the
constraints imposed by the limits of physical techniques such as the CPU (Reed and Dongarra,
2015). For this analysis, the sequential method was chosen since it enables the program to carry
out each instruction in the specified order and guarantees that there is only one active context at
any given time during the program's running. This indicates that when numerous jobs need to be
done, they will need to be processed one after the other, resulting in a drop in efficiency.
Utilizing parallel approaches is one way to reduce the amount of time spent analyzing data. This
is completing several activities simultaneously instead of carrying them out in the order they
were initially assigned, increasing speed and productivity.

The code for the sequential process is built so that a comparison may be made between the speed
and efficiency of sequential and parallel operations. This code sequentially evaluates the data
after taking into account the parameters that the user entered. The code begins by establishing the
beginning and ending times for each stage of the analysis, and then it executes in a manner
limited to a single core. This allows the user to gain an idea of the time estimates for the current
study, as well as an estimate for the overall amount of time it will take to analyze all of the data.
The amount of time spent on the currently active stage will be displayed, along with an estimate
of the total amount of time required for the entire code. The screenshot illustrates an example of
this output.
Parallel processing
Due to the development of current CPUs that feature multi-core, parallel processing capabilities,
the processing method known as parallel processing is becoming an increasingly popular choice.
The use of a multiprocessor enables the execution of numerous tasks in parallel, making parallel
processing possible. Parallelism is the name given to the process that enables many tasks to be
carried out simultaneously. Through the use of parallel processing, it is possible to perform
multiple tasks at the same time, which both speeds up the processing and saves time. A code that
supports parallel processing must first be developed to implement the parallel technique. By
determining the number of cores per CPU run and the number of data sets that will be analyzed
in a given day, this code can help determine the method of data processing that will be the most
effective. Users can execute tasks more effectively and timely when they use parallel processing,
which allows for optimizing processing time and speeding up the work.
Sequential and Parallel outcome testing
During the actual testing process, we discovered that parallel testing is more efficient than
sequential testing. A data set with 250 records and a data set with 5000 records were used in this
study. The test time for sequential processing in the 250 data set was 8.72 seconds. Sequential
processing uses only a single core. Because my CPU only has six cores, the times for parallel
processing (from two cores to six cores) were 7.24 seconds, 5.58 seconds, 5.06 seconds, 4.92
seconds, and 4.73 seconds, respectively. The utilization of several processor cores results in a
discernible increase in speed. In the 5000 data set, the test time for sequential processing (only
using a single core) was 168.04 seconds. Still, the test time for parallel processing (using
anywhere from two to six cores) was 133.08 seconds, 94.57 seconds, 76.93 seconds, 66.34
seconds, and 60.11 seconds, respectively. The time it took for each of the six cores to complete
the task was much less when compared to the time it took for the single core. This is further
evidence of the benefits of parallel processing.
A time comparison chart was generated using mapping software to provide a more in-depth
illustration of this issue. This graphic demonstrates that the amount of time required by each of
the six cores was much less when compared to the time required by the single core, with the
improvement exhibited in the 5000 data set is the most noticeable. This exemplifies the
advantages of parallel processing, as it dramatically reduces the time required to process both
data sets. Overall, the results of this testing procedure have demonstrated that parallel processing
is noticeably more effective than sequential processing and that it can be applied to any activity
to increase both its speed and level of productivity.
250 data 8.72s 7.24s 5.58s 5.06s 4.92s 4.73s
5000 data 168.04s 133.08s 94.57s 76.93s 66.34s 60.11s
When completing the same tasks, it is abundantly clear that completing them in parallel is more
productive than doing so sequentially. According to the line graph analysis, the amount of time
needed to accomplish the task gets shorter as the number of processors in the central processing
unit (CPU) increases. This suggests that parallelism is preferable to sequential processes in
almost every respect.

Testing Code
Its three components need to be evaluated to establish whether or not the parallel strategy is
successful. The code, the data, and the test code are all included in these components. By running
the code through its appropriate motions and verifying the outcomes, we can ensure that it
functions as intended. Examining the data will verify that the data being used is accurate and
valid. Testing the test code will guarantee that it will run without any problems and appropriately
reflect the program's outcomes.
Text error
Testing the software for text errors is vital in ensuring the program's quality. The CreateTestData
Text script was used to build the test file test. nc, which was then used to check the accuracy of
the content contained within this report. To carry out the test, extracting the test file from the
source data file was necessary, as carrying out data type analysis and saving the results in an
array. The experiment was a success, and the outcomes corresponded to what was anticipated, as
demonstrated in the photographs provided below.
NaN test
A mistake frequently occurring in MATLAB is referred to as a not-a-number (NaN) error.
Mathematical operations such as 0/0, Inf/Inf, INF-INF, and Inf*0, which all contribute to
uncertainty and the NaN error, are the root cause of the problem. To check for NaN problems,
you will need to construct a test dataset and run the create test data nan.m script on the test
file.nc file that contains the analysis data. After that, you will compare the actual results with the
expected ones. However, because the data that NaN provides could be more precise in the file, it
is essential to employ conditional judgment in addition to other approaches to pinpoint the
position of the problem. Doing so will save time.

Log file
During the testing phase, maintaining a log file makes it simpler to localize areas where mistakes
have occurred. The error, misplacement, or other problem can be localized with the help of
TestSolutionWithLogFile by annotating the time it occurred in the log file with "xx" hours. The
accompanying picture is an output from a test demonstrating that this solution is effective.
Automated testing
Introduction to Automated Tests
Parallel processing may now be used to analyze the data requested by the client, thanks to the
code that was built. To provide consumers with a "one-stop service," a leading software that
combines all features was developed. This application provides line graphs comparing the
efficiency of parallel and sequential processing, text and NaN tests for test data and code, and
error logs to help clients find and resolve problems. Users can also rapidly alter the target
settings and run the application with a single click while preserving all data for easy review and
extraction.
Results of automated tests
All data was successfully shown in the workbench, and all test results were produced as
expected.
Estimate the number of processors.
Preliminary Prediction of Outcome
Based on testing comparing parallel and sequential methods with the same data, the time
required to review the data decreased as the number of processors rose. The test results are
nearly linear, as evidenced by the line graph.
Estimates of processors using functions

Saving the test results allows you to extrapolate the data and use MATLAB's drawing software
to create a precise line chart displaying the outcomes of parallel processing. It was estimated that
at least 13 cores would be required to get the desired result.
More efficient predictive processor cores
According to research, advances in computing power only occur linearly. For multi-core
computers, the bottleneck caused by packaging bottlenecks is the capacity to expand memory
bandwidth indefinitely, resulting in non-linear performance growth. Furthermore, the operating
system's thread scheduling and switching between CPU cores may result in a non-linear increase
in multi-core performance. An exponential regression function, rather than a linear one, may
yield more reliable estimations of the number of target cores (Xu and Duan, 2019).
Conclusion
This essay is meant to be a comparison of sequential versus parallel programming. The results
show that parallel code outperforms sequential code in terms of efficiency and ability to meet the
client's needs. It was found that a minimum of 13 CPUs would be required to achieve the goal.
Furthermore, a system that performs a routine search for frequent errors was put up to swiftly
zero in on any problems and store their coordinates in a log for later scrutiny by the client. A
single script containing all the functions and settings was created for the user's convenience.
References
Reed, D.A. and Dongarra, J., 2015. Exascale computing and big data. Communications of the
ACM, 58(7), pp.56-68.
Xu, L.D. and Duan, L., 2019. Big data for cyber-physical systems in industry 4.0: a survey.
Enterprise Information Systems, 13(2), pp.148-169.

TT156 Edited

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TT156 Edited

Uploaded by

Copyright:

Available Formats

BIG DATA PROGRAMMING PROJECT: ATMOSPHERIC SCIENCE AND CLIMATE

it will include utilizing MATLAB to examine a three-dimensional grid of atmospheric

different chemical species.

achieve the specified efficiency level.

Overcoming loading nearly 9TB in memory

us to research on a personal computer, which made it more flexible.

Sequential processing compared to parallel processing

were initially assigned, increasing speed and productivity.

Sequential and Parallel outcome testing

evidence of the benefits of parallel processing.

to increase both its speed and level of productivity.

250 data 8.72s 7.24s 5.58s 5.06s 4.92s 4.73s

5000 data 168.04s 133.08s 94.57s 76.93s 66.34s 60.11s

almost every respect.

reflect the program's outcomes.

demonstrated in the photographs provided below.

A mistake frequently occurring in MATLAB is referred to as a not-a-number (NaN) error.

is essential to employ conditional judgment in addition to other approaches to pinpoint the

position of the problem. Doing so will save time.

Introduction to Automated Tests

Results of automated tests

Estimate the number of processors.

Preliminary Prediction of Outcome

nearly linear, as evidenced by the line graph.

Estimates of processors using functions

at least 13 cores would be required to get the desired result.

More efficient predictive processor cores

bandwidth indefinitely, resulting in non-linear performance growth. Furthermore, the operating

Enterprise Information Systems, 13(2), pp.148-169.

You might also like