Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 28

+

Department of Artificial Intelligence


22AIE101: Problem Solving & C
Programming
2023
Project Report By:
Sai Akhilesh Y CB.SC.U4AIE23328
Santosh K CB.SC.U4AIE23310

Srikar C CB.SC.U4AIE23355

DATA SAMPLING
Supervised By:
Dr. Vidhya Kamakshi V

Assistant Professor Department of

Artificial Intelligence
Amrita Vishwa Vidyapeetham

1
+

Amrita Vishwa Vidyapeetham, Coimbatore

CERTIFICATE

This is to certify that the project entitled, “DATA SAMPLING” submitted by


Group 1 in partial fulfilments for the requirements for the award of Bachelor of
Technology Degree in Computer Science and Artificial Intelligence Engineering
at Amrita Vishwa Vidyapeetham, Coimbatore (Deemed University) is an
authentic work carried out by them under my supervision and guidance

To the best of my knowledge, the matter embodied in the report

has not been submitted to any other University / Institute for the

award of any Degree or Diploma.

Date:

(Dr. Vidhya Kamakshi V)


Dept. of Artificial Intelligence Engineering

Amrita Vishwa Vidyapeetham,

Coimbatore – 641112, Tamil Nadu

2
+

CONTENTS
1. INTRODUCTION………………………. 4-6

2. OVERVIEW……………………………... 7-10

3. LITERATURE REVIEW……………….. 11

4. FINAL CODE & OUTPUT……………… 12-23

5. FUTURE SCOPE………………………... 24-25

6. CONCLUSION………………………….. 26-27

7. WORKLOAD DISTRIBUTION……….. 28

3
+

1. INTRODUCTION

Data Sampling:

Definition: Data sampling is a statistical method of selecting a subset of


data from a larger dataset to make inferences about the population from
which the data is drawn.

Fig-1

Purpose of Data Sampling:

1. Resource Efficiency: Sampling allows researchers to analyse a


fraction of the entire dataset, saving time and resources compared
to analysing the entire population.

4
+

2. Feasibility: In cases where it's impractical to study the entire


population, sampling provides a manageable and practical way to
draw conclusions.
3. Inference: By studying a representative sample, one can make
generalizations and draw inferences about the characteristics of
the entire population.
4. Accuracy Assessment: Sampling helps assess the accuracy of
predictions and statistical analyses by providing insights into the
variability within the dataset.

Uses of Data Sampling:

1. Market Research: Sampling is widely used in market research to


understand consumer behaviour, preferences, and trends without
having to survey an entire population.
2. Quality Control: In manufacturing and production processes,
sampling is used to inspect a subset of products to ensure quality
standards are met.
3. Health Studies: Sampling is employed in medical research to
study the health of populations, test the effectiveness of
treatments, or assess the prevalence of diseases.
4. Opinion Polls: Political and social scientists often use sampling to
gauge public opinion by surveying a representative subset of the
population.

5
+

5. Financial Audits: Auditors use sampling to assess the accuracy of


financial statements and transactions without reviewing every
single transaction.
6. Environmental Studies: Sampling is used in environmental
research to study air, water, and soil quality by analysing samples
rather than the entire ecosystem.

In summary, data sampling is a powerful tool that allows researchers and


analysts to draw meaningful insights from a smaller subset of data,
providing a cost-effective and efficient way to make informed decisions
and predictions about larger populations.

6
+

2. PROJECT OVERVIEW

Based on the chances of elements that were needed to be included into


sample, It is of two types:
 Probability Sampling
 Non-Probability Sampling
Types of Data Sampling: Probability and Non-Probability Sampling

Probability Sampling:

1. Simple Random Sampling:


 Description: Every individual in the population has an equal chance of
being selected.
 Procedure: Randomly selecting elements, often using random number
generators.
 Use Case: When the population is homogeneous and each member is
equally likely to be representative

Fig-2
2. Stratified Sampling:
 Description: Divides the population into subgroups (strata) based on
specific characteristics, and then samples from each stratum.

7
+

 Procedure: Randomly selecting individuals from each stratum


proportionate to the population.
 Use Case: Ensures representation from different segments of the
population.

Fig-3

3. Systematic Sampling:
 Description: Selects every kth element from a list after a random
starting point.
 Procedure: Choosing a random starting point and then selecting every
kth element.
 Use Case: When there is an ordered list and a systematic approach is
desired.

8
+

Fig-4
4. Cluster Sampling:
 Description: Divides the population into clusters, randomly selects
some clusters, and includes all members from those clusters in the
sample.
 Procedure: Randomly selecting clusters and including all individuals
from those clusters.
 Use Case: Practical when it's difficult to directly sample individuals, and
clusters represent distinct groups.

9
+

Fig-5

Non-Probability Sampling:

1. Convenience Sampling:
 Description: Involves selecting individuals who are easiest to reach or
obtain.
 Procedure: Choosing participants based on accessibility and
convenience.
 Use Case: Often used for quick and cost-effective studies but may
introduce bias.
2. Purposive Sampling:
 Description: Selects participants based on specific criteria or
characteristics relevant to the research.
 Procedure: Purposefully choosing individuals who meet defined
criteria.
 Use Case: When researchers want to study a particular subgroup with
distinct characteristics.
3. Snowball Sampling:
 Description: Starts with a few participants who, in turn, refer others to
participate.
 Procedure: Participants refer additional participants, creating a
"snowball" effect.
 Use Case: Useful when studying hard-to-reach populations or those
with shared characteristics.
4. Quota Sampling:
 Description: Involves selecting individuals based on pre-defined
quotas, ensuring representation from different categories.
 Procedure: Setting quotas for different subgroups and sampling
individuals to meet those quotas.

10
+

 Use Case: Useful when researchers want to ensure a diverse sample.

3. Literature Review

1. Acharyal, B., Bhattarai, G., de Gier, A., and Stein, A. (2000). Systematic
adaptive cluster sampling for the assessment of rare tree species in
Nepal. Forest Ecology and Management, 137, 65–73.
This study by Acharyal et al. focuses on the application of systematic adaptive
cluster sampling for assessing rare tree species in Nepal. The use of adaptive
cluster sampling suggests a methodical approach to data collection in
ecological studies. The paper may shed light on the challenges associated with
assessing rare species and how systematic adaptive cluster sampling
addresses these challenges.
2. Becker, E. F. (1991). A terrestrial furbearer estimator based on probability
sampling. Journal of Wildlife Management, 55, 730–737.
Becker's work in the Journal of Wildlife Management introduces a terrestrial
furbearer estimator based on probability sampling. The emphasis on
probability sampling suggests a rigorous and statistically sound approach to
estimating wildlife populations. This paper may contribute insights into the
development of estimation methods for terrestrial furbearers and highlight
the importance of probability-based sampling in wildlife management.
3. Bellhouse, D. R. (1988b). A brief history of random sampling methods. In
P. R. Krishnaiah and C. R. Rao (eds.), Handbook of Statistics, Vol. 6,
Sampling. Amsterdam: Elsevier Science Publishers, pp. 1–14.
Bellhouse's work provides a historical perspective on random sampling
methods, offering valuable insights into the evolution of sampling techniques.
This chapter within the "Handbook of Statistics" may serve as a foundational

11
+

resource for understanding the historical development and context of various


sampling methods. It could be particularly useful for researchers interested in
the broader context of sampling methodologies.

4. FINAL CODE AND OUTPUT

Simple Random Sampling


 Code
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>

 #define MAX_SIZE 1000

 int main() {
 char file_path[MAX_SIZE];
 int sample_size;

 printf("Enter the path of the file: ");
 fgets(file_path, sizeof(file_path), stdin);
 file_path[strcspn(file_path, "\n")] = '\0';

 printf("Enter the sample size: ");
 scanf("%d", &sample_size);

 FILE *file = fopen(file_path, "r");

 if (file == NULL) {
 printf("File not found.\n");
 return 0;
 }

 char content[MAX_SIZE];
 fgets(content, sizeof(content), file);
 content[strcspn(content, "\n")] = '\0';

12
+

 char *elements[MAX_SIZE];
 int num_elements = 0;

 char *token = strtok(content, ",");

 while (token != NULL) {
 elements[num_elements] = malloc(strlen(token)
+ 1);
 strcpy(elements[num_elements], token);
 num_elements++;
 token = strtok(NULL, ",");
 }

 if (sample_size > num_elements) {
 sample_size = num_elements;
 }

 char *samples[sample_size];

 for (int i = 0; i < sample_size; i++) {
 int random_index = rand() % num_elements;
 samples[i] =
malloc(strlen(elements[random_index]) + 1);
 strcpy(samples[i], elements[random_index]);
 }

 printf("Random samples: ");

 for (int i = 0; i < sample_size; i++) {
 printf("%s ", samples[i]);
 }

 printf("\n");

 for (int i = 0; i < num_elements; i++) {
 free(elements[i]);
 }

 for (int i = 0; i < sample_size; i++) {

13
+

 free(samples[i]);
 }

 fclose(file);

 return 0;
 }

 Output

Systematic Sampling
Code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAX_SIZE 1000

char** systematic_sampling(char* file_path, int


sample_size) {
FILE* file;
char* content;
char** elements;
char** samples;
int total_elements;
int interval;
int start_index;

14
+

int i;

file = fopen(file_path, "r");

if (file == NULL) {
perror("Error opening file");
return NULL;
}

content = (char*)malloc(MAX_SIZE * sizeof(char));

if (content == NULL) {
printf("Memory allocation failed.\n");
fclose(file);
return NULL;
}

fread(content, sizeof(char), MAX_SIZE, file);

fclose(file);

elements = (char*)malloc(MAX_SIZE * sizeof(char));


elements[0] = strtok(content, ",");

total_elements = 1;

while (total_elements < MAX_SIZE) {


elements[total_elements] = strtok(NULL, ",");
if (elements[total_elements] == NULL) {
break;
}
total_elements++;
}

for (i = 0; i < total_elements; i++) {


char* trimmed = strtok(elements[i], " \t\n\r");

if (trimmed != NULL) {
strcpy(elements[i], trimmed);
}
}

15
+

interval = total_elements / sample_size;


start_index = rand() % interval;

samples = (char*)malloc(sample_size * sizeof(char));

if (samples == NULL) {
printf("Memory allocation failed.\n");
free(elements);
free(content);
return NULL;
}

for (i = 0; i < sample_size; i++) {


samples[i] = elements[start_index + i * interval];
}

printf("First randomly selected value: %s\n",


samples[0]);
printf("Calculated k value (sampling interval): %d\n",
interval);

printf("Systematic samples: ");

for (i = 0; i < sample_size; i++) {


printf("%s ", samples[i]);
}

printf("\n");

free(elements);
free(content);

return samples;
}

int main() {
char file_path[MAX_SIZE];
int sample_size;
char** result;

printf("Enter the path of the file: ");


fgets(file_path, sizeof(file_path), stdin);

16
+

file_path[strcspn(file_path, "\n")] = '\0';

printf("Enter the sample size: ");


scanf("%d", &sample_size);

result = systematic_sampling(file_path, sample_size);

// Free allocated memory (if needed)


if (result != NULL) {
free(result);
}

return 0;
}

 Output

Cluster Sampling
 Code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

void cluster_sampling(float *data, int data_size, int


num_clusters, int cluster_size, float **clusters, float
***selected_clusters) {
// Divide the data into clusters

17
+

int num_data_clusters = (data_size + cluster_size - 1)


/ cluster_size;
*clusters = (float *)malloc(data_size *
sizeof(float));
*selected_clusters = (float **)malloc(num_clusters *
sizeof(float *));
for (int i = 0; i < num_data_clusters; i++) {
int start = i * cluster_size;
int end = (i + 1) * cluster_size < data_size ? (i
+ 1) * cluster_size : data_size;
(*clusters)[i] = data[start];
for (int j = start + 1; j < end; j++) {
(*clusters)[i * cluster_size + j - start] =
data[j];
}
}

// Randomly select a sample of clusters


int *selected_indices = (int
*)malloc(num_data_clusters * sizeof(int));
for (int i = 0; i < num_data_clusters; i++) {
selected_indices[i] = i;
}
for (int i = 0; i < num_clusters; i++) {
int j = rand() % (num_data_clusters - i) + i;
int temp = selected_indices[i];
selected_indices[i] = selected_indices[j];
selected_indices[j] = temp;
(*selected_clusters)[i] = (float
*)malloc(cluster_size * sizeof(float));
for (int k = 0; k < cluster_size; k++) {
(*selected_clusters)[i][k] = (*clusters)
[selected_indices[i] * cluster_size + k];
}
}
free(selected_indices);
}

int main() {
srand(time(0)); // Seed for random number generation

18
+

// User input for data size, number of clusters, and


cluster size
int data_size, num_clusters, cluster_size;
printf("Enter the size of the data: ");
scanf("%d", &data_size);
float *data = (float *)malloc(data_size *
sizeof(float));
for (int i = 0; i < data_size; i++) {
printf("Enter data element %d: ", i + 1);
scanf("%f", &data[i]);
}
printf("Enter the number of clusters: ");
scanf("%d", &num_clusters);
cluster_size = data_size / num_clusters;

// Ensure that cluster size is less than or equal to


the data size
if (cluster_size > data_size) {
printf("Error: Cluster size cannot be greater than
data size. Please enter valid numerical values.");
return 1;
}

// Perform cluster sampling


float *clusters;
float **selected_clusters;
cluster_sampling(data, data_size, num_clusters,
cluster_size, &clusters, &selected_clusters);

// Display all clusters


printf("\nData: ");
for (int i = 0; i < data_size; i++) {
printf("%f ", data[i]);
}
printf("\nNumber of Clusters: %d\n", num_clusters);
printf("Cluster Size: %d\n", cluster_size);
printf("\nAll Clusters:\n");
for (int i = 0; i < data_size / cluster_size; i++) {
printf("Cluster %d: ", i + 1);
for (int j = 0; j < cluster_size; j++) {
printf("%f ", clusters[i * cluster_size + j]);
}

19
+

printf("\n");
}

// Randomly select one cluster from the selected


clusters
int random_index = rand() % num_clusters;
printf("\nRandomly Selected Cluster:\n");
for (int i = 0; i < cluster_size; i++) {
printf("%f ", selected_clusters[random_index][i]);
}

// Free allocated memory


free(data);
free(clusters);
for (int i = 0; i < num_clusters; i++) {
free(selected_clusters[i]);
}
free(selected_clusters);

return 0;
}

 Output

20
+

Stratified Sampling
 Code
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>

 void generate_random_strata(float *data, int data_size, int num_strata,
float **strata) {
 int i, j;
 int stratum_size = data_size / num_strata;

 // Shuffle the data
 for (i = data_size - 1; i > 0; i--) {
 int j = rand() % (i + 1);
 float temp = data[i];
 data[i] = data[j];
 data[j] = temp;
 }

 // Create strata
 for (i = 0; i < num_strata; i++) {
 char stratum_name[20];
 sprintf(stratum_name, "Stratum_%d", i + 1);

 strata[i] = (float *)malloc(stratum_size * sizeof(float));

 for (j = 0; j < stratum_size; j++) {
 strata[i][j] = data[i * stratum_size + j];
 }
 }
 }

 void random_sampling_from_strata(float **strata, int num_strata, int
stratum_size, int sample_size, float *sampled_data) {
 int i, j;
 int k = 0;

 // Combine all strata data into a single array

21
+

 float *all_data = (float *)malloc(num_strata * stratum_size *


sizeof(float));
 for (i = 0; i < num_strata; i++) {
 for (j = 0; j < stratum_size; j++) {
 all_data[k++] = strata[i][j];
 }
 }

 // Randomly sample data
 for (i = 0; i < sample_size; i++) {
 int random_index = rand() % (num_strata * stratum_size);
 sampled_data[i] = all_data[random_index];
 }

 free(all_data);
 }

 int main() {
 srand(time(NULL));

 int data_size;
 printf("Enter the size of the data: ");
 scanf("%d", &data_size);

 float *data = (float *)malloc(data_size * sizeof(float));
 int i;
 for (i = 0; i < data_size; i++) {
 printf("Enter data element %d: ", i + 1);
 scanf("%f", &data[i]);
 }

 int num_strata;
 printf("Enter the number of strata (groups): ");
 scanf("%d", &num_strata);

 float **strata = (float **)malloc(num_strata * sizeof(float *));
 generate_random_strata(data, data_size, num_strata, strata);

 int sample_size;
 printf("Enter the sample size: ");
 scanf("%d", &sample_size);

 float *sampled_data = (float *)malloc(sample_size * sizeof(float));
 random_sampling_from_strata(strata, num_strata, data_size /
num_strata, sample_size, sampled_data);

 printf("\nStrata:\n");
 for (i = 0; i < num_strata; i++) {

22
+

 printf("Stratum_%d: ", i + 1);


 for (int j = 0; j < data_size / num_strata; j++) {
 printf("%.2f ", strata[i][j]);
 }
 printf("\n");
 }

 printf("\nRandomly Sampled Data:\n");
 for (i = 0; i < sample_size; i++) {
 printf("%.2f ", sampled_data[i]);
 }

 // Free memory
 for (i = 0; i < num_strata; i++) {
 free(strata[i]);
 }
 free(strata);
 free(data);
 free(sampled_data);

 return 0;
 }

 Output

23
+

5.FUTURE SCOPE
The future scope of data sampling is likely to be influenced by advancements in technology,
changes in data availability, and evolving data science methodologies. Here are some
potential trends and areas of growth in the future of data sampling:

1. Big Data and Streaming Data:


 As the volume of data continues to grow exponentially, there will be a
need for efficient sampling techniques for big data and streaming data
applications. Traditional sampling methods may need to be adapted or
new methods developed to handle large and continuously streaming
datasets.
2. Machine Learning and AI:
 The integration of data sampling techniques with machine learning
algorithms is likely to increase. Researchers and practitioners may
explore innovative ways to incorporate sampling into training
processes, making machine learning models more scalable and
efficient.
3. Real-time Sampling:
 The demand for real-time insights is increasing across various
industries. Future data sampling methods may need to focus on
providing real-time sampling solutions to support timely decision-
making in dynamic environments.
4. Adaptive Sampling:
 Adaptive sampling methods, which dynamically adjust the sampling
strategy based on the characteristics of the data, may become more
prevalent. This can lead to more accurate and efficient representations

24
+

of datasets, especially in situations where data distribution changes


over time.

5. Privacy-Preserving Sampling:
 With growing concerns about data privacy, there may be an increased
focus on developing sampling methods that can generate
representative samples without compromising individual privacy.
Differential privacy and other privacy-preserving techniques could play
a significant role in this context.
6. Domain-Specific Sampling Techniques:
 Different domains may have unique characteristics that require
specialized sampling techniques. Future developments may include the
creation of domain-specific sampling methods tailored to the needs of
specific industries such as healthcare, finance, and social sciences.

7. Blockchain and Sampling Audits:

 Blockchain technology may be employed to create transparent and


verifiable records of sampling processes, enabling better auditing and
ensuring the integrity of sampled data. This could be particularly
important in industries where data accuracy and accountability are
critical.
8. Integration with Data Governance:
 Data governance practices will likely become more integrated with data
sampling methodologies. This includes establishing and enforcing
policies related to data sampling, ensuring compliance with regulations,
and maintaining data quality standards.
9. Automated and Autonomous Sampling:

25
+

 Advances in automation and artificial intelligence may lead to the


development of autonomous sampling systems that can intelligently
adapt and adjust their sampling strategies based on evolving data
patterns and business requirements.

6. CONCLUSION
In conclusion, the project on data sampling has explored various aspects of sampling
methodologies, techniques, and their applications in the realm of data science. The
key findings and takeaways from the project are summarized below:

1. Significance of Data Sampling:


 The project emphasized the critical role that data sampling plays in
managing and analyzing large datasets. Sampling provides a cost-
effective and efficient way to draw meaningful insights from extensive
data collections.
2. Sampling Techniques and Methods:
 The project delved into a variety of sampling techniques, including
random sampling, stratified sampling, systematic sampling, and more.
Each technique was evaluated in terms of its applicability to different
scenarios and its impact on the accuracy of results.
3. Challenges and Considerations:
 Challenges associated with data sampling, such as bias,
representativeness, and the impact on statistical inference, were
thoroughly examined. Strategies to address these challenges, including
the use of adaptive sampling and advanced statistical methods, were
explored.
4. Applications Across Industries:

26
+

 The project highlighted the diverse applications of data sampling


across various industries, from healthcare and finance to marketing and
social sciences. Case studies were presented to illustrate how sampling
can be instrumental in making informed decisions in different domains.
5. Technological Advances:
 The project acknowledged the influence of technological
advancements on the future of data sampling. Big data, machine
learning, and real-time processing were identified as areas where
sampling methodologies are evolving to meet the demands of modern
data analysis.
6. Privacy and Ethics:
 The ethical considerations of data sampling, especially concerning
privacy, were addressed. The project explored emerging techniques like
differential privacy and discussed the importance of adhering to ethical
standards in the collection and use of sampled data.
7. Future Trends:
 The project anticipated future trends in data sampling, including its
integration with machine learning, real-time applications, and the
development of domain-specific sampling methods. The role of data
governance, blockchain, and open-source initiatives in shaping the
future of sampling practices was also highlighted.

27
+

7. WORK LOAD DISTRIBUTION

 Sai Akhilesh Y- Simple Random Sampling and Systematic Sampling


 Santosh K- Cluster Sampling
 Srikar C- Stratified Sampling

28

You might also like