Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 2

Create a PySpark or Any Python distributed driven based framework, which will apply the rules over the

incoming kafka stream data ,


here apply the rules present in RULE table and create an entry to any persistent storage with logical table as (RULE_BREAK) if rule break
occurs ?

We have only one data source with signal range signal which is producing the data every 1 minute as kafka streams and we have below 3
or n number of tags with their data type

Table : TAG : Definition for the sensor tag


--------------------
tag_id -> tag id for a signal
tag_name -> tag name for a signal
data_type -> data type it's restricted to string, int, double

tag_id | tag_name | data_type


1 | t1 | double
2 | t2 | int
3 | t3 | String

Table: RULE : Definition for the rule break expression


--------------------
rule_id -> tag id for a signal
rule_name -> tag name for a signal
rule_expression -> expression for the rule need to be applied ,Only Operators we need to support are > , = , < , != as per the data types
rule_break_count -> number of rule break

# Note – for each rule break definition , we should use only one tag , such as tag t1 > 4 occurs for 4 times

rule_id rule_name rule_expression rule_break_count rule_description


1 t1_double_rule t1 > 55.43 4 if t1 > 55.43 for consecutive 4 times it's rule break
2 t2_int_rule t2 > 20 6 if t2 < 20 for consecutive 6 times it's rule break
3 t3_string_rule t3 = ON 3 if t3 = ON for consecutive 3 times it's rule break

Table : Output : RULE_BREAK schema should contains below fields


---------------------------
rule_id -> rule id which break
rule_break_stop_timestamp -> it's a timestamp at which rule break count satisfied the criteria , we need. to capture the stop time when
rule break condition satisfied. While streaming , we can have multiple rule breaks for each definition depending upon it’s criteria

Example :
Kafka Streaming Data Sample would be like given below :- You can generate your own data for testing
{ timestamp : 1571053218000 , { t1 : 55.23 , t2 : 10 , t3 :'ON' } }
{ timestamp : 1571053278000 , { t1 : 63.23 , t2 : 11 , t3 :'OFF' } }
{ timestamp : 1571053338000 , { t1 : 73.23 , t2 : 12 , t3 :'ON' } }
{ timestamp : 1571053398000 , { t1 : 83.23 , t2 : 13 , t3 :'ON' } }
{ timestamp : 1571053458000 , { t1 : 20.23 , t2 : 14 , t3 :'ON' } }
{ timestamp : 1571053518000 , { t1 : 30.23 , t2 : 25 , t3 :'OFF' } }
So on . . .

Result would be-


Rule break happened for 1 and 3 only , so it will results with below output
rule_id | rule_break_stop _timestamp
1 | 1571053398000
3 | 1571053458000

Note:-
We only create entry to the RULE_BREAK table , only if the condition satisfied for n number of consecutive rule break for the streaming
records.
if the condition is not satisfied in current record , we should reset it and again apply the pattern/rule break conditions

▪ Briefly describe the conceptual approach you chose! What are the trade-offs?
▪ What's the runtime performance? What is the complexity? Where are the bottlenecks?
▪ If you had more time, what improvements would you make, and in what order of priority?

You might also like