Week7 Bda

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Week-7

Split lines from the textbook into arrays or words and perform
operations using select ().
Aim: A program to Split lines from the text book into arrays or words and
perform operations using select().
Description:
The split() method splits a string into an array of substrings. The split() method
returns the new array. The split() method does not change the original string. If ("
") is used as separator, the string is split between words.
Dataset:

Program:
from pyspark.shell import spark
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.functions import split

book = spark.read.text("Text.txt")
book.show()
lines = book.withColumn("words", split(book["value"], " "))
lines.show()
linesID = lines.withColumn("ID", monotonically_increasing_id()+1)
linesID.show()
#lines.select("words").head(5)
linesID.select("value","ID").filter(linesID.ID <=5).show()
middleRowID = lines.count()//2
linesID.select("value","ID").filter((linesID.ID >= (middleRowID-2)) & (linesID.ID <=
(middleRowID+2))).show()

linesID.select("value","ID").filter(linesID.ID%2 !=0).show()

from pyspark.sql.functions import size


word_count_df = lines.select('words', size('words').alias('word_count'))
word_count_df.show()
filtered_df = word_count_df.filter(word_count_df.word_count == 19)
filtered_df.show(truncate=False)

from pyspark.sql.functions import expr


filtered_lines = lines.withColumn("3CharacterWords",expr("filter(words, word -> length(word)
= 3)"))
filtered_lines.select("3CharacterWords").show()

You might also like