Professional Documents
Culture Documents
T13, 14
T13, 14
It describes the large volume of Data both Structured and Unstructured. The term Big Data
refers to simply use of predictive analytics, user behavior analytics and other advanced data
analytics methods.
It extracts value from data and seldom to a particular size to the data set. The challenge
includes capture, storage, search, sharing, transfer, analysis, creation.
5) What is ‘training set’ and ‘test set’ in a machine learning model? How much data will you
allocate for your training, validation, and test sets?
Training dataset – to identify relationships between inputs and outputs based on historical
patterns in the data.
Test dataset – used to test the model’s ability to predict well on new data.
The training set is examples given to the model to analyze and learn. 70% of the total data is
typically taken as the training dataset. This is labeled data used to train the model.
The test set is used to test the accuracy of the hypothesis generated by the model.
Remaining 30% is taken as testing dataset. We test without labeled data and then verify
results with labels.
9) Discuss the mechanism of blockchain and its application of distributed ledger technology.
Blockchain is a type of digital ledger in which information is recorded sequentially within
blocks that are then linked or “chained” together and secured using cryptographic methods.
Each block contains a grouping of transactions (or entries) and a secure link (known as a
hash) to the previous block. New transactions are inserted into the chain only after
validation via a consensus mechanism in which authorized members agree on the
transaction and the preceding order, or history, in which previous transactions have
occurred.
The consensus mechanism used to verify a transaction includes a cryptographic problem
that must be solved by some computers on the network (known as miners) each time a
transaction takes place. The process to update the blockchain can require substantial
amounts of computing power, making it very difficult and extremely expensive for an
individual third party to manipulate historical data.