Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Parallelisation Project

Digital Music Analysis C#

CAB401: High Performance and Parallel Computing

Aden Harris n10235728


Date: 23/10/2020

Digital music analysis 1

High-Level Description 2
freqDomain 2
onsetDetection 2
First Analysis for potential parallelism 3
freqDomain 3
Timefreq 4

Analysis of ‘fft’ function 4


Transform fft from recursive to iterative 6
Binary Search Tree 8

Iterative FFT & Parallelisation Attempt 9

Implemented Parallelism 10

Results 10
Profiler Results 10
Speedup Graph 11
Maintaining Data Consistency 11

Resources used during Project 12


Parallelism abstractions used in Project 12
Considerations and Project Hurdles 12

Reflection 13

References 13
Digital music analysis
Digital music analysis is an open source C# application that aims to provide beginner violin
students with a practising tool that gives feedback on their playing. The software is ideally
supposed to act as a substitute to having a violin teacher present.

At a high-level the program delivers its advice by first analysing a piece of violin provided by the
user. Then each note in that audio file is compared in pitch against the correct corresponding
note of the original piece, provided to the program in an XML file.

More specifically, the Digital music analysis program will generate a new score with both the
original composition and the notes detected from the user’s wav file.

The user can then navigate through each note as the program shows the pitch expected to
match the original piece and margin of error commenting if the user was either too flat or sharp.

While the program at this stage is operational, it suffers from significant loading delays following
file selections. It also has a mostly primitive GUI interface that could hinder the user's ability to
determine the entirety of the app’s features without prior explanation.

One of its rivals is intunator which also is a pitch corrective tool for practising instruments.[1] Its
notable differences are that intunator focuses on both wind and string instruments as well as
providing note feedback instantaneously after the user plays a note.

My intention initially is to add parallelism in this program is to focus on reducing the start-up time
of the app after the user selects their files.
High-Level Description
Once the program commences the ‘MainWindow’ function is called, which will prompt the user
to select a digital audio (wav) file to analyse and a sheet music(xml) file for the program to read.

After which a new thread will be created to concurrently update a duration slider as the music is
being played.

freqDomain
The audio file is then loaded in and if successful a new wavefile will be formed. This wavefile is
then analysed within the ‘freqDomain()’ function. The purpose of this function is to transfer the
digital audio data of the .wav file to a time-frequency format that is then visualised using a new
array ‘pixelArray’.

The function begins with a timefreq object being created ‘stftRep’, which accepts our audio file
contained in the new wavefile and a sample rate of 2048.

The remainder of freqDomain uses the time frequency data of the song to form a visualisation of
the raw data in the audio file viewable in the Frequency tab of the program for diagnostic
purposes.

onsetDetection
Now the Sheet music Xml is then loaded in and we have access to the frequencies at each
sample from Complex array x. These frequencies are assessed by the ‘onsetDetection’ function
to determine the beginning and ending of each note.

Knowing the start and finish of each note, ‘onsetDetection’ performs its own fast fourier
transform over a longer period than the 2048 sample intervals in timefreq however.
After the transform has finished, all of the frequencies of each note will be identified and
compared against all of the octaves to find the highest intensity frequency in each.
This decided frequency then used to determine the note being played.

‘onsetDetection’ also handles the visual components of the staff tab which consists of the
different coloured notes on the stave according to their accuracy as well as displaying each
notes individual data. This information is triggered by a mouseover event for each note and
shows whether the note’s pitch was correct or the margin of error. Lastly, onsetDetection
handles timing rectangles below the stave that assess the note’s duration.
First Analysis for potential parallelism
Having assessed how the Digital Music Analysis program worked, next is to identify any
performance bottlenecks in the code that could be addressed using parallelism.

The first approach I went with was to take advantage of the existing CPU profiler in the IDE i’m
using (Visual Studio). This tool both allowed me to both accurately view the distribution of CPU
usage of the program and the sections of code that were particularly taxing on the system.

This test highlighted two prominent areas that were impacting the programs overall
performance, being the timefreq class with the freqDomain function and onsetDetection
function.

Between them they make up a total 56% of the program’s CPU time. Now that these have been
identified for potential parallelism I will review each indepth to locate their hot spots and
determine their suitability for parallelism.

freqDomain
Using the inbuilt Visual Studio profile we can see that the vast majority of the computation taking
place in timefreq is situated at its first line (266). This assignment takes on average​ ​1600
milliseconds to run which makes up 96% of its total time.
So clearly timefreq is in more need of a speedup, so I'm going to focus on this area for
parallelism.
Timefreq
In a broad sense, timefreq takes the audio file we read in previously as an array and a sample
rate. It then performs a Fast Fourier transform by creating an array of complex numbers.
Once complete, it gathers a two dimensional array of the audio file’s time and frequencies
(timeFreqData) by using the new complex array to run a short Fast Fourier transform at line 51.
This information will be later used to identify what notes are being played at any point in time
based on the highest intensity frequencies measured.

This takes place within the ‘stft’ function that begins with setting up for the later ‘fft’ function
which will actually conduct the transfer. The setup involves dividing the song into intervals
based on the sample rate to then commence running ‘fft’ to do the transfer.

Analysis of ‘fft’ function


After locating regions of the program that take up large proportions of it’s execution time, i have
found that not only is the ‘fft’ function called in ‘timefreq’ taking up a significant time to compute,
there is an also mirror duplicate used in ‘onsetDetection’ .
Recursive call within serial ‘fft’ line 142, 143

The first roadblock to parallelising this function is its pre existing output dependencies. This is a
result of this recursive version of Fast Fourier transform which will overwrite the values of the E
and O arrays as it splits into more subprocesses. In its current state with this dependency it is
unrecommended to parallelise it, as there is a high likelihood that the original results won’t
remain the same.

The next thing to do is consider how I could alter or replace the algorithms or entire function to
better enable parallelism. But, to change it while still reserving it original purpose, we need to
understand how it works.

Fast Fourier transform primarily uses the Cooley-Turkey algorithm, a divide and conquer
algorithm that is both recursive and uses twiddle factors.
Our function starts by dividing the transform into an even and odd array and then repeating the
process with new arrays. Each layer a split occurs, decreasing the length of each array in half
while increasing the amount of arrays by a power of two each time. Beginning with one array of
length 2048 to eventually 2048 arrays of 2 length. Reaching this point, the arrays begin to
recombine pairwise using an equation involving twiddle factors to arrive at a final array of
frequencies to be returned.
Transform fft from recursive to iterative
In my initial attempt I tried separating the FFT function into a dividing stage followed by
recombining at the end. This led me to trying to create a helper function that would handle all of
the divisions iteratively and return the final list of arrays to be re-joined in the second half of
FFT.

Recursive version Iterative version

Ideally i would have the helper function take the original array of complexes and return it
grouped in pairs to begin calculating the final returned array.

Given this array for example: Complex[] x = [a, b, c, d, e, f, g, h, i, j, k, l, m, n, o,p].

The helper function would return the elements in the following way:
ai em ck go bj fn dl hp

Then the second half of FFT will apply the following function to them pairwise:
Now we have the following arrays:
A1 = f(A,I) B1 = f(E,M) C1=f(C,K) D1=f(G,O) E1=f(B,J) F1=f(F,N) G1=f(D,L) H1=(H,P)

Next round, the function will be applied to A1 B1, C1 D1, E1 F1 and to G1 H1, yielding the
following arrays:
A2=f(A1,B1) B2=f(C1,D1) C2=f(E1,F1) D2=f(G1,H1)

Then, we apply the function to A2 B2 and C2 D2 fielding the following arrays:


A3=f(A2,B2) B3=f(C2,D2)

After exhausting the smaller arrays the final result works out to be:
result = f(A3,B3).

Below was the closest I got to replicating my idea using this method. I discontinued this
approach after realising that figuring out a suitable way to ensure the correct values were being
updated and used correctly was far too challenging.
Binary Search Tree
My second shot at converting the FFT function was to replace my previous helper function and
instead implement a binary search tree to record each division of the arrays. My justification for
trialing this was because previously I needed a way to track the data of the previous array when
creating a new subprocess else the ‘E’ and ‘O’ arrays would be overwritten out of sequence.
A BST appeared to have potential as each array could be stored as a node with their respective
even and odd arrays becoming their left and right children.

The Binary Search Tree surprisingly created the correct amount of nodes and lengths however
the original data wasn’t able to be replicated correctly. The further limitations of this approach
also became apparent when starting to trial parallelising the code involved with the assignment
of new nodes. Determining how to return the smallest series of nodes (arrays of length 2) in a
valuable way also proved deceiving difficult. In retrospect, clearly my implementation mustn't
have been thread safe as any parallelisation attempts would end up contributing to distorting the
results further.
Iterative FFT & Parallelisation Attempt
Finally I adapted an iterative version of Fast Fourier Transform I found online.[2]
After confirming the new version still produced the same output I identified the nested for loops
beginning from line 222 to be the most impactful site for parallelising.
I first opted to use the Parallel.For method from the Task Parallel Library. I found that when
used on the inner loops, since they run a large number of small iterations it caused a noticeable
amount of overhead that eclipsed the serial runtime. This is most likely the result of having to
invoke the delegate every iteration alongside excessive thread synchronization. To offset this I
focused on parallelising the outermost loop to limit the creation of threads for trivial computation.

The next challenge were the data dependencies, stepping through the for loops i assessed that
the values for evenIndex, oddIndex and term were all subject to change based on the order of
execution. This often led to index out of bounds exceptions being thrown when testing varying
styles of Parallel.For implementation.

To try addressing this issue, I investigated how locks might be used to prevent the parallelism
from causing more erroneous behaviour.
I wrapped the critical sections with locks so that when the operations in the inner loop are being
used by a thread they can’t be accessed simultaneously by another thread.

Ultimately I was unable to devise a method parallelising the fast fourier transform in a way that
both reproduced the original results exactly and offered a speed improvement from the
sequential program.

Attempted Parallelisation of outer for loop CPU usage after


Implemented Parallelism
Even though I wasn't able to get my identified hotspots to run in parallel, I still managed to find a
couple other sections of code to speed up using my chosen parallelisation framework.

These areas proved more feasible to parallelise, having no data dependencies between
iterations inherent data parallelism was possible. Because these sections were already
embarrassingly parallel, I didn't have to restructure anything and could simply convert the
existing for loop using a Parallel.For to parallelise it.

It is located on line 364 of the MainWindow file within the onsetDetection function.
The resulting change caused this loop to drop from 3.74% to 0.83% of the program’s total CPU
usage. That is an average of 3.5 times faster.

Results

Profiler Results

Sequential

Parallel

onsetDetection stays around 200 milliseconds faster in parallel version


Speedup Graph

Maintaining Data Consistency


While adding and reworking the code during this project, I paid attention to ensure that the
changes I made didn’t accidentally alter the results of the program. To detect obvious changes I
periodically compared the visual components of the program, which often was enough to notice
glaring alterations. I also had to come up with a more precise approach when data changes
could have been more subtle.

In this event i would write the before and after data to a file and run them against a comparator
function. This will detect changes faster and more accurately than manual comparisons and will
also return all instances of data inconsistencies.

Comparer function detecting data inconsistencies between versions


Resources used during Project
Over the course of this project I've benefited from numerous tools and software that simplified
the process of parallelizing the program. For starters since the Digital Music Analysis program is
coded in C# i performed all development using Visual Studio 2019 which operates with the .NET
Compiler Platform (Roslyn) as it’s C# compiler. I opted to use Visual Studio 2019 as it’s both a
platform I'm familiar with and has its own inbuilt CPU profiler. This proved invaluable for
assessing where in the program it was best to focus the parallelism efforts and it’s critical ‘hot
spots’.

All testing will be performed on the same computer running an AMD CPU with 6 physical cores
and 12 virtual cores at 3.6GHz. The pc also has a m.2 storage drive and 16gbs of DDR4
memory.

For the actual parallelism itself i'm primarily going to incorporate the Task Parallel Library and
System.Threading.Tasks namespace which was initially introduced as part the .NET framework
version 4. Specifically the Parallel.For method will be my first option for parallelising ‘For’ loops.

Parallelism abstractions used in Project


I elected to use the Task Parallel Library, which is the preferred way to parallel code using
.NET.
It takes advantage of using Implicit Threads which are higher level constructs that streamlines
creating and working with threads.

Considerations and Project Hurdles


Over the course of this project there were a number of obstacles that stood in the way of
running the program in parallel.

The most prevalent issue that arose was dealing with data dependencies and critical sections.
Often the most immediate remedy was to force threads to only execute on thread safe code.
This prevents threads from disrupting the code’s intended functionality and was mainly
managed using some form of lock. If that failed, the next option i would try was to rebuild the
offending code to better expose any potential parallelism.
The primary occurrence of this for me was during my attempts at rewriting the program’s fast
fourier transform from recursive to iterative.

Even if there was inherent parallelism available I still found myself running the risk of causing
code to run even slower than before. The most predominant culprit in this case was often
excessive overhead. Mainly when trying to parallel trivial loops, the overhead of parallelisation
ended up hindering performance instead. As a rule of thumb, I tried dealing with overhead by
prioritising outer loops where possible.
Reflection
Overall the parallelisation of the Digital Music Analysis program proved to be an exercise in
patience and thinking out of the box. While i couldn’t figure out how to implement all of the
parallelisms that i had planned, i still assert that in the end, this assignment ended up becoming
more of a learning experience than just a test of my competence with parallel programming.
Quite unique to this project, I spend a significant amount of time not having a clue how to
progress over the course of this assignment. To it’s credit however I was often forced to end up
digging deep to think of unordinary answers to obstacles.

So while results wise my entire parallel solution wasn’t very successful based on only achieving
an estimated 3% CPU usage reduction and failure to parallelise the fft function. I'm personally
content, having vastly improved my understanding of how you would prepare a program to be
parallelised. I think ideally you could parallelise both fft functions decreasing the CPU usage by
asleast 50%.​ ​I felt I was very close to figuring it out, perhaps the version of FFT I was using in
the end with the three nested For loops made it too difficult to parallelise the outer loop without
significantly reworking it again or running into more dependency issues.

Since I was completely unfamiliar with audio coding programs going into this project, I didn't
know which parts of the original program were unique to the digital music analysis application or
were universal functions for manipulation audio data.
Because of my lack of prior research into this I ended up spending far too long during this
project unknowingly trying to manually recreate FFT, a very complex algorithm that was
otherwise well documented. So if i had the opportunity to continue working on this solution i
would have attempted to rework the fft function again and explored more parallelisation
methods. Maybe explicit threads such as the .NET threads class could have been more
suitable.

Although objectively, I only managed to pull off a fraction of the total potential parallelism for this
program,I'm still satisfied by my effort in really trying to accomplish something I wasn't remotely
confident in.

References
1. Remmel M. Intunator App [Internet]. Intunator.com. 2020 [cited 23 October 2020]. Available
from: https://www.intunator.com/en

2. Fast Fourier transform - Rosetta Code [Internet]. Rosettacode.org. 2020 [cited 23 October
2020]. Available from: https://rosettacode.org/wiki/Fast_Fourier_transform

Thank you for marking my report,


Aden

You might also like