#+TITLE: Parallel and Distributed Simulation of Large-Scale Distributed Applications
#+AUTHOR:  Ezequiel Torti Lopez, Martin Quinson
#+EMAIL:   ret0110@famaf.unc.edu.ar, martin.quinson@loria.fr
#+STARTUP: indent hideblocks
#+TAGS: noexport(n)
#+EXPORT_SELECT_TAGS: export
#+EXPORT_EXCLUDE_TAGS: noexport
#+PROPERTY: session *R* 

#+LATEX_class: sigalt
#+LATEX_HEADER: \usepackage[T1]{fontenc}
#+LATEX_HEADER: \usepackage[utf8]{inputenc}
#+LATEX_HEADER: \usepackage{ifthen,figlatex}
#+LATEX_HEADER: \usepackage{longtable}
#+LATEX_HEADER: \usepackage{float}
#+LATEX_HEADER: \usepackage{wrapfig}
#+LATEX_HEADER: \usepackage{subfigure}
#+LATEX_HEADER: \usepackage{xspace}
#+LATEX_HEADER: \usepackage[american]{babel}
#+LATEX_HEADER: \usepackage{url}\urlstyle{sf}
#+LATEX_HEADER: \usepackage{amscd}
#+LATEX_HEADER: \usepackage{wrapfig}

#+LATEX_HEADER: \usepackage{algorithm}
#+LATEX_HEADER: \usepackage[noend]{algpseudocode}
#+LATEX_HEADER: \renewcommand{\algorithmiccomment}[1]{// #1}
#+LATEX_HEADER: \makeatletter
#+LATEX_HEADER: \addto\captionsenglish{\renewcommand{\ALG@name}{Heuristic}}
#+LATEX_HEADER: \makeatother
#+LATEX_HEADER: \usepackage{caption}
#+LATEX_HEADER: \DeclareCaptionLabelFormat{alglabel}{\bfseries\csname ALG@name\endcsname:}
#+LATEX_HEADER: \captionsetup[algorithm]{labelformat=alglabel}

* Motivation and Problem Statement

Simulation is the third pillar of science, allowing to study complicated
phenomenons through complex models. When the size or complexity of the studied
models becomes too large, it is classical to leverage more resources through
Parallel Discrete-Event Simulation (PDES).

Still, the parallel simulation of very fine grained applications deployed on
large-scale distributed systems (LSDS) remains challenging. As a matter of fact,
most simulators of Peer-to-Peer systems are sequential, despite the vast
literature on PDES over the last three decades.

dPeerSim is one of the very few existing PDES for P2P systems, but it presents
deceiving performance: it can achieve a decent speedup when increasing the
amount of logical processes (LP): from 4h with 2 LPs down to 1h with 16 LPs.
But it remains vastly inefficient when compared to sequential version of
PeerSim, that performs the same experiment in 50 seconds only. This calls for a
new parallel schema specifically tailored to this category of Discrete Event
Simulators.

Discrete Event Simulation of Distributed Applications classically alternates
between simulation phases where the models compute the next event date, and
phases where the application workload is executed.  We proposed

in~\cite{previous} to not split the simulation model across several computing
nodes, but instead to keep the model sequential and execute the application
workload in parallel when possible. We hypothesized that this would help
reducing the synchronization costs. We evaluate our contribution with very fine
grained workloads such as P2P protocols. These workloads are the most difficult
to execute efficiently in parallel because execution times are very short,
making it very difficult to amortize the synchronization times.

We implemented this parallel schema within the SimGrid framework \cite{simgrid}, and showed
that the extra complexity does not endangers the performance since the
sequential version of SimGrid still outperforms several competing solutions when
our addition are present but disabled at run time.

To the best of our knowledge, it is the first time that a parallel simulation of
P2P system proves to be faster that the best known sequential execution. Yet,
the parallel simulation only outperforms sequential one when the amount of
processes becomes large enough. This is because of the pigonhole principle: when
the amount of processes increases, the average amount of processes that are
ready to run at each simulated timestamp (and can thus run in parallel)
increases. When simulating the Chord protocol, it takes 500,000 processes or
more to amortizing the synchronization costs, while the classical studies of the
literature usually involve less processes.

The current work aims at further improving the performance of our PDES, using
several P2P protocols as a workload. We investigate the possible inefficiency
and propose generic solutions that could be included in other similar simulators
of large-scale distributed systems, be them P2P simulators of cloud, HPC or
sensornets ones.

This paper is organized as follows: Section \ref{sec:context} recaps
the SimGrid architecture and quickly presents the parallel execution
schema detailed in \cite{previous}. Section \ref{sec:problem} analysis
the current performance of our model. Section \ref{sec:parallel}
explores several trade-offs for the efficiency of the parallel
sections. Section \ref{sec:adaptive} proposes an algorithm to
automatically tune the level of parallelism that is adapted to the
simulated application. Section \ref{sec:cc} concludes this paper and
discusses some future work.

# the theoretical performance bound, and discusses the previous work at the light of the Amhdal law

* Context
#+LaTeX: \label{sec:context}
In the previous work \cite{previous} we proposed to parallelize the
execution of the user code while keeping the simulation engine
sequential.  This is enabled by applying classical concepts of OS
design to this new context: every interaction between the user
processes (from now on, user processes and processes mean the same
thing) and the simulated environment passes through a specific layer
that act as an OS kernel.

A novel way to virtualize user processes (\emph{raw contexts}) was
crafted to improve efficiency and avoid unnecesary system calls, but
other ways to do this can be found for the sake of portability, such
as full featured threads, or POSIX \emph{ucontexts}. A new data structure to
store the shared state of the system and synchronize the process
execution was implemented as well (\emph{parmap}).

The new layer acting as the OS kernel was implemented in SimGrid to
emulate systems calls, called \emph{requests}. Each time a process
want to interact with other process, or the engine itself, it raises a
\emph{request}.  After what is called a 'Scheduling Round' (SR), all
the active processes have raised their request and wait for a
response, or have finished their work. Then the engine takes control
of the program and answer sequentially the \emph{requests} of each
process. This way the user processes can be parallelized in a safe
manner. More details on the execution model can be found in
\cite{previous} and the the official website of SimGrid
\cite{simgrid}.

Experimental results showed that the new design does not hinder the
tool scalability, and even the sequential version is more scalable
than state of the art simulators.  The difficulty to get a parallel
version of a P2P simulator faster than its sequential counterpart was
also revealed in \cite{previous}, being the first time that a
parallel simulation of Chord runs faster than the best known
sequential implementation.

Another interesting result showed in the previous work is that the
speedups only increased up to a certain point when increasing the
amount of working threads.  We also have proved that for small
instances, parallelism actually hinders the performance, and that the
relative gain of parallelism seems even strictly increasing with the
system size.

Now we are closer to the optimal Amdahl's law threshold, that means
that we have reach a limit on the parallelizable portions of the code
in our proposed model.  The remaining optimizations look for a final
speedup, trying to get a better parallel threshold dynamically
depending on the simulation, and better performance of the threads
taking in count their distribution on the CPU cores and the different
synchronization modes (futex, POSIX primitives or busy waiters).

All the experiments were run using the facilities provided by
Grid'5000 \cite{g5k}.

* Performance Analysis
#+LaTeX: \label{sec:problem}
** Current speedup achieved
# Also, the benchmarking not intrusive is here.
To get baseline timings and a speedup plot starting from the
development version of SimGrid (3.12), benchmarks to measure the
execution time in Precise mode with different amount of threads (1, 2,
4, 8, 16 and 24) were done. For this we used an implementation of the
well known Chord protocol \cite{chord} as workload.

The absolute times of a normal execution for the Chord simulation are
presented in the table \ref{tab:one}.

#+caption: Execution times of a normal execution of Chord with different sizes, serial and with 2 and 8 threads. The average memory consumption is reported in GB.
#+name: tab:one
|---+-------+---------+-------+---------+-------+---------+-------|
|   | nodes |  serial |   Mem |  2 thr. |  Mem. |  8 thr. |  Mem. |
| / | <>    |       < |     > |       < |     > |       < |     > |
|---+-------+---------+-------+---------+-------+---------+-------|
| # | 10k   | 0:01:03 |  0.25 | 0:01:20 |  0.26 | 0:01:35 |  0.25 |
| # | 50k   | 0:06:20 |  1.24 | 0:07:39 |  1.27 | 0:08:03 |  1.25 |
| # | 100k  | 0:13:34 |  2.47 | 0:15:36 |  2.53 | 0:15:50 |  2.50 |
| # | 300k  | 0:50:58 |  7.38 | 0:55:18 |  7.54 | 0:57:55 |  7.47 |
| # | 500k  | 1:38:16 | 12.30 | 1:34:15 | 12.47 | 1:35:10 | 12.45 |
| # | 1m    | 4:05:41 | 24.53 | 4:00:42 | 24.89 | 3:47:28 | 24.91 |
|---+-------+---------+-------+---------+-------+---------+-------|

As it can be seen in Figure \ref{fig:one.one}, the memory consumption
linearly increases with respect to the number of simulated nodes, and
shows that each node is using around 25 KB and 30 KB of memory. A
simulation with 1000 nodes, has a peak memory consumption around 30 MB
(regardless of the amount of threads launched) and finishes in 4
seconds in a serial execution, and one with 1000000 nodes takes
24-25GB of memory and 3h47m to finish in the best case (parallel
execution with 8 threads).

#+attr_latex: width=0.8\textwidth,placement=[p]
#+label: fig:one.one
#+caption: Memory consumptions reported in GB
[[file:fig/memory-consumption.pdf]]

The actual speedup obtained can be seen in the Figure \ref{fig:one}.
It is clear from that graph that the real speedup with our parallel
model is obtained when the size of the problem is bigger than 300000
nodes.  This confirms what was proved in \cite{previous}.

Figure \ref{fig:one} also shows that increasing the number of threads
may not be the best option to increase performance, since the best
speedups are achieved with 2,4 and 8 threads. Some of the
optimizations proposed in section \ref{sec:parallel} show improvements
over the original versions with 16 and 24 threads, but their total
times are still behind the ones of the same simulations with lesser
amount of threads.

#+attr_latex: width=0.8\textwidth,placement=[p]
#+label: fig:one
#+caption: Baseline performance of SimGrid 3.11. Speedups achieved using multithreaded executions against the sequential ones.
[[file:fig/baseline-perf.pdf]]

# We want to see now is how far are we from the ideal speedup that
# would be achieved according to the Amdahl law.  For that, a benchmark
# test is run to get the timings of the sequential and parallel parts of
# the executions, and the calculate that speedup using the Amdahl
# equation.

# But first we want to prove that our benchmarks are not intrusive, that
# is, our measures of parallel and sequential times do not really affect
# the overall performance of the system. For that, the experiments are
# run with and without benchmarking, using the Precise mode, and then a
# comparison of both is made to find if there is a significative breach
# in the timings of both experiments.

# Using the Chord simulation, the experiment showed us that the maximum
# difference in the execution time of both versions is lesser than 10%
# in most of the cases, and is even lower with sizes bigger than 100000
# nodes, which allow us to conclude the benchmarking is, indeed, not
# intrusive.

** Parallelizable portions of the problem
We want to analyze each SR and find any possible performance problem
here, since is the portion of code that is run in parallel in our
model. Using the same Chord implementation as workload, we want to
gather the following data: ID of each Scheduling Round, time taken by
each Scheduling Round and number of process executed in each
scheduling round.

As it can be seen in the Figure \ref{fig:two}, the amount of SR's
having just one process varies between 26% and 48% (the larger the
simulated size, the lower the amount of SR's that have only one
process) while the others involve two or more processes. These
remaining processes are executed in parallel due to the parallel
execution threshold already setted up in SimGrid (which can be
modified trough a parameter).

#+attr_latex: width=0.8\textwidth,placement=[p]
#+label: fig:two
#+caption: Proportions of SR's having different numbers of processes to compute; according to the size of nodes simulated.
[[file:fig/sr-distribution.pdf]]

However, launching a small amount of processes is inefficient due to
the synchronization costs of threads.  Even when Figure
\ref{fig:three} shows that the bigger the amount of processes in a SR,
the bigger the execution time, there is no speedup obtained from
executing small amounts of processes in parallel, as we will see in
Section \ref{sec:adaptive}. Hence, it would be convenient to know,
during a simulation, when to launch SR in parallel and when to do it
sequentially. A heuristic to accomplish that is proposed later in
this document.

#+attr_latex: width=0.8\textwidth,placement=[p]
#+label: fig:three
#+caption: Average times of sequential executions of SR's depending on the amount of processes of each SR.
[[file:fig/sr-times.pdf]]

* Optimizations
#+LaTeX: \label{sec:parallel}
** Binding threads to physical cores

Regarding the multicore architectures (like almost every modern CPU),
parallelization through threads is well proved to be a good choice if
done correctly. This approach, used currently by SimGrid, showed a
good gain in speed with bigger sizes, as we said in Section
\cite{sec:problem} But there are still improvements that might reduce
the noise and the overhead that inherently comes with threads.

Thread execution depends heavily on the operative system scheduler:
when one thread is \emph{idle}, the scheduler may decide to switch it
for another thread ready to work, so it can maximize the occupancy of
the CPU cores, and probably, run a program in a faster way. Or it may
just want to switch threads because their execution time quote is
over. When the first thread is ready to work again, the CPU core where
it was before might be occupied, forcing the system to run the thread
in another core.

Regardless of the situation, or the scheduler we are using, the
general problem remains: increasing the CPU migrations of threads can
be detrimental for the performance.

In order to avoid these CPU migrations produced by a constant context
switching of threads, GLibc \cite{glibc} offers a way to bind each thread to a
physical core of the CPU. Note that this is only available in Linux
platforms.

A Chord simulation was run in a parapluie node with 24 cores, binding
the threads to physical cores. The CPU migration was drastically
reduced (almost 97\% less migrations) in all the cases, but the
relative speedup was not significant: always lower than x1.5,
regardless the amount of threads/sizes.  However, the bigger speedups
were obtained with sizes less than 100000 nodes, which allow us to
conclude that CPU migrations should be avoided when the simulation is
small enough, since they introduce an unwanted overhead.

** Parmap between N cores

Several optimizations regarding the distribution of work between
threads were proposed: the first option is the default one, where
maestro works with its threads and the processes are distributed
equitably between each thread; the second one is to send maestro to
sleep and let the worker threads do all the computing; the last one
involves the creation of one extra thread and make all this N threads
work while maestro sleeps.

The experiments showed that no performance gain was achieved. In fact,
the creation of one extra thread proved to be slower than the original
version of parmap, while sending maestro to sleep and make its N-1
threads do the computation did not show any improvement or loss in
performance.

** Busy Waiting versus Futexes

SimGrid provides several types of synchronization between threads:
Fast Userspace Mutex (futex), the classical POSIX synchronization
primitives and busy waiters.  While each of them can be chosen when
running the simulation, futexes are the default option, since they
have the advantage to implement a fast synchronization mode within the
parmap abstraction, in user space only.  But even when they are more
efficient than classical mutexes (which run in kernel space), they may
present performance drawbacks that inherently come with
synchronization costs. In this section we compare busy waiters
and futexes performances, using the Chord example.

As it can be seen in Figure \ref{fig:four}, the gain in speed is
immediate with small sizes: the elimination of any synchronization
call makes the simulation run up to 2 times faster. However, we can
see the performance drop and match the one achieved with futexes with
bigger sizes.

#+attr_latex: width=0.8\textwidth,placement=[p]
#+label: fig:four
#+caption: Relative speedup of busy waiters vs. futexes in Chord simulation.
[[file:fig/busy.pdf]]

* Optimal threshold for parallel execution
#+LaTeX: \label{sec:adaptive}
** Getting a real threshold over simulations
Plus the optimization of
The threshold wanted is how many processes are the right amount to be
executed in parallel when it is necessary, and when is it better to
execute them in a sequential way. Initially, what we want is to find
an optimal threshold for the beginning of any simulation.  For that
purpose, we have done a benchmark to get each SR execution time for both
parallel and serial executions, and calculated the speedup obtained in
each SR.

A typical run using Chord and with 10000 examples shows that after 500
processes per SR, the speedup is always bigger than one. It is
interesting to note that even for simulations with different sizes the
similar limit is reached. Analyzing the data thoroughly tell us that
the 83\% of SR's with processes between 250 and 300 show a speedup. In
consequence, 250 processes will be our base threshold for parallel
execution, and the adaptive algorithm proposed in next section will be
in charge of increasing or decreasing that threshold according to the
needs and characteristics of the simulation.

In Figure \ref{fig:five} we can see the example with 10000 nodes
simulated. Although it seems there is an important amount of SR with
less than 250 processes that are faster in parallel, they represent
only the 5\% of that subset of SR's. The remaining 95\% of SR's with
less than 500 processes showed speedup equal or less than 1.

#+attr_latex: width=0.8\textwidth,placement=[p]
#+label: fig:five
#+caption: Speedup of parallel vs. sequential executions of SR's, depending in the number of processes taken by each SR
[[file:fig/sr-par-threshold_10000.png]]

** Dynamic estimation of the optimal threshold
Finding an optimal threshold and keep it during all the simulation
might not always be the best option: some simulations can take more or
less time in the execution of user processes. If a simulation has very
efficient processes, or processes that don't work too much, then the
threshold could be inappropriate, leading to parallelize scheduling
rounds that would run more efficiently in a sequential way.  That's
why an heuristic for a dynamic threshold estimation is proposed.

The main idea behind this  heuristic (\ref{adaptive-algorithm}) is to
calculate the optimal number of processes that can be run in parallel
during the execution of the simulation.

For that purpose, the time of a certain amount of scheduling round is
measured. A performance ratio for the parallel and sequential
executions is calculated, simply by dividing the time taken by the
amount of processes computed.  If the sequential ratio turns to be
bigger than the parallel one, then the threshold is decreased, and
increased otherwise.

A naive implementation of this heuristic, showed a small relative
improvement in performance. The times were certainly reduced with
small sizes, since it chooses to execute the majority of
the processes sequentially, while with bigger sizes (more than 100000
nodes), the speedup is insignificant. In terms of absolute times, we
can see that the execution times have been slightly reduced (up to ten
minutes less in a one million nodes simulation in the best case, with
8 threads).

This improvements may be small due to the fact that we are calculating
the ratio with the times of the latest SR's, and in consequence, using
values that may not represent the general situation.

A new approach, using a cumulative ratio (calculated during all the
simulation) instead of the one computed with the latest values, proved
better in terms of performance. This approach also changes the way we
do the timings: instead of benchmarking the SR's each time, we
benchmark the SR's that have certain amount of processes, limited by
an upper limit for parallel execution and a lower limit for the
sequential ones. This is a way to prevent the timing of extreme cases
(very big or very small number of processes) which may introduce
errors in the estimation of the threshold, and acts like a 'window' to
filter the cases we are interested in.

These limits are calculated along the simulation with the average
amount of processes that have been run so far in parallel (or serial),
plus the standard deviation. Since we never know beforehand the amount
of SR's we will have, the average and the standard deviation are
computed using the algorithm of Welford \cite{acsvar,csacsmv}.

When a prefixed amount of parallel and sequential SR's have been run,
we proceed to update the threshold applying a similar rule of thumb:
if the sequential executions were better and we have a bigger number
of processes than the corresponding average, we increase the
threshold, giving a chance to the serial executions to prove they are
better. Otherwise, if the parallel executions performed better and the
number of processes of the current SR is smaller than the average, we
decrease the threshold.

This new implementation proved to be faster than the original parallel
version with sizes under 300000 nodes, while with bigger amount of
nodes the speedup remains almost the same. It also avoids the increase
of the threshold to unrealistic values (which may happen in the naive
version, due to fact that we have a lot of SR's with small amount of
processes that are computed sequentially and the fact that we increase
the threshold each time a sequential execution performs better than a
parallel).

All the experiments were performed setting the initial threshold to
250 processes, which was estimated as an optimal starting threshold in
previous section. The heuristic lead to different final thresholds
depending on the initial one, of course, since the SR's launched in
parallel will not be the same from the beginning. However, experiments
showed that it behaves quite stable, and there is a tendency to
increase/decrease the threshold in the same simulation regardless the
one at the beginning.

#+begin_latex
\begin{algorithm}
\caption{Adaptive Threshold}\label{adaptive-algorithm}
\begin{algorithmic}

\State
\Comment {Amount of parallel/sequential SRs that ran}
\State $parallel\_SRs, sequential\_SRs \gets \textit{1}$
\State
\Comment {Sum of times of par/seq SR's}
\State $seq\_time, par\_time \gets \textit{0}$
\State
\Comment {Number of processes computed in par/seq}
\State $process\_seq, process\_par \gets \textit{0}$
\State
\Comment {Average amount of processes parallel/sequential}
\State avg\_par\_proc, avg\_seq\_proc
\State
\Comment {Standard deviation of processes parallel/sequential}
\State sd\_seq\_proc, sd\_par\_proc
\State

\Procedure{RunSchedulingRound}{}

\If {computed five par/seq SR's}
\State $ratio\_seq \gets seq\_time/process\_seq$
\State $ratio\_par \gets par\_time/process\_par$
\State $sequential\_is\_slower \gets ratio\_seq>ratio\_par$
\If {$sequential\_is\_slower$}
\If {$processes\_to\_run < avg\_par\_proc$}
\State decrease($parallel\_threshold$)
\EndIf
\Else
\If {$processes\_to\_run > avg\_seq\_proc$}
\State increase($parallel\_threshold$)
\EndIf
\EndIf
\EndIf

\State

\If {$processes\_to\_run >= parallel\_threshold$}

\If {$processes\_to\_run < par\_window$}
\State $parallel\_SRs++$
\State start($timer$)
\State execute\_SR\_parallel()
\State stop($timer$)
\State $par\_time \gets par\_time + $elapsed($timer$)
\State $process\_par \gets process\_par + processes\_to\_run$
\State $avg\_par\_proc \gets $calculate\_current\_avg\_of\_par\_processes()
\State $sd\_par\_proc \gets $calculate\_current\_sd\_of\_par\_processes()
\State $par\_windows = avg\_par\_proc + sd\_par\_proc$
\Else
\State execute\_SR\_parallel()
\EndIf


\Else

\If {$processes\_to\_run < seq\_window$}
\State $sequential\_SRs++$
\State start($timer$)
\State execute\_SR\_serial()
\State stop($timer$)
\State $seq\_time \gets seq\_time + $elapsed($timer$)
\State $process\_seq \gets process\_seq + processes\_to\_run$
\State $avg\_seq\_proc \gets $calculate\_current\_avg\_of\_seq\_processes()
\State $sd\_seq\_proc \gets $calculate\_current\_sd\_of\_seq\_processes()
\State $seq\_windows = avg\_seq\_proc - sd\_seq\_proc$
\Else
\State execute\_SR\_serial()
\EndIf

\EndIf

\EndProcedure
\end{algorithmic}
\end{algorithm}
#+end_latex

#+attr_latex: width=0.8\textwidth,placement=[p]
#+label: fig:six
#+caption: Speedups achieved with Adaptive threshold heuristic. Chord simulation.
[[file:fig/adapt-algorithm.pdf]]

Regarding the memory consumption, the values remain the same in
general, as it can be seen in Table \ref{tab:two}.

#+caption: Execution times (seconds) of the Adaptive threshold heuristic, with 2,4 and 8 threads. The average memory consumption is reported in GB.
#+name: tab:two
|---+-------+---------+-------+---------+-------+---------+-------|
|   | nodes |  2 thr. |   Mem |  4 thr. |   Mem |  8 thr. |   Mem |
| / | <>    |       < |     > |       < |     > |       < |  >    |
|---+-------+---------+-------+---------+-------+---------+-------|
| # | 10k   | 0:01:19 |  0.26 | 0:01:20 |  0.26 | 0:01:27 |  0.25 |
| # | 50k   | 0:07:21 |  1.27 | 0:07:28 |  1.27 | 0:07:30 |  1.26 |
| # | 100k  | 0:15:16 |  2.53 | 0:15:04 |  2.55 | 0:14:48 |  2.51 |
| # | 300k  | 0:54:48 |  7.55 | 0:54:05 |  7.52 | 0:53:44 |  7.48 |
| # | 500k  | 1:38:52 | 12.47 | 1:35:19 | 12.56 | 1:31:50 | 12.45 |
| # | 1m    | 3:59:12 | 24.89 | 3:47:22 | 25.19 | 3:37:12 | 24.91 |
|---+-------+---------+-------+---------+-------+---------+-------|


* Conclusion
#+LaTeX: \label{sec:cc}
We have shown in this work several ways to optimize large scale
distributed simulations in a specific framework, namely, binding
threads to physical cores, choosing a better threshold for parallel
execution or choosing between different synchronization modes between
threads. The optimizations were done over the open-source
multi-purpose SimGrid simulation framework, in its development version
(3.12). Some of the changes proposed worked in some scenarios better
than others (for instance, the binding threads to cores optimization
showed a real speedup in simulations using bigger amount of threads,
such as 16 or 24, while using busy waiters proved to be better than
futexes in simulations with small sizes and small amount of
threads). Also, some of the modifications did not affect the overall
performance, or even made it worst, like the parmap changes proposed
in Section \ref{sec:parallel}.

Most of the changes proposed gained performance with small sizes
simulations (under 300000 nodes), but remained the almost the same
with larger ones, showing the difficulty of optimizing a complex
multi-threaded system.

We certainly arrived to a point where optimization depends heavily on
reducing the synchronization costs and playing with low level features
of the code. An intelligent choice of when to launch processes in
parallel and when to do it in a serial way proved to help with small
cases but it was unnecessary with bigger ones, where there is already
speedup achieved using threads to simulate user processes.

In a final note, the present work was done with the reproducible
research approach in mind. Hence, the steps and scripts needed to run
the experiment can be found in the appendix section.

* Acknowledgments
 Experiments presented in this paper were carried out using the Grid'5000
 experimental testbed, being developed under the INRIA ALADDIN development 
 action with support from CNRS, RENATER and several Universities as well 
 as other funding bodies (see https://www.grid5000.fr).


#+LaTeX: \bibliographystyle{abbrv}
#+LaTex: \bibliography{report}

#+LaTeX: \onecolumn
#+LaTeX: \appendix
* Data Provenance
This section explains and show how to run the experiments and how the
data is saved and then processed.  Note: that all experiments are run
using the Chord simulation that can be found in \texttt{examples/msg/chord}
folder of your SimGrid install. Unless stated, all the experiments are
run using the futex synchronization method and raw contexts under a
Linux environment; in a 'parapluie' node at Grid5000.
The analysis of data can be done within this paper itself, executing
the corresponding R codes. Note that it is even possible to execute
them remotely if TRAMP is used to open this file (this is useful if
you want the data to be processed in one powerful machine, such as a
cluster).
** Modifiable Parameters
Some of the parameters to run the experiments can be modified, like
the amount of nodes to simulate and the amount of threads to use.
Note that the list of nodes to simulate have to be changed in both the
python session and the shell session.  This sessions are intended to
last during all your experiments/analysis.

This sizes/threads lists are needed to run the simulations, generate
platform/deployment files, and generate tables after the
experiments. Hence, is mandatory to run this snippets.

Bash:
#+begin_src sh :session org-sh
BASE_DIR=$PWD
sizes=(1000 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 70000 75000 80000 85000 90000 95000 100000 300000 500000 1000000)

threads=(1 2 4 8 16 24)
#+end_src

Python:
#+name: set_python_args
#+begin_src python :session
  SIZES = [1000]
  SIZES += [elem for elem in range(5000,100000,5000)]
  SIZES += [100000,300000,500000,1000000]
  THREADS = [1, 2, 4, 8, 16, 24]
  # All the benchmarks can be done using both modes, but note that this
  # paper uses only precise
  MODES = ['precise']
  nb_bits = 32
  end_date = 10000
#+end_src

** Setting up the machine
Install required packages to compile/run SimGrid experiments. If you
are in a cluster (such as Grid5000) you can run this file remotely in
a deployed node and still be able to setup your environment.  Run this
two code chunks one after other in order to create folders, install
packages and create required deployment/platform files.

If the [[setup\_and\_install]] snippet was run before, or everything is
already installed and set up, then check/modify the parameters of the
shell session with the snippets [[check\_args]] and [[go\_to\_chord]]

\texttt{setup\_and\_install}:
#+name: setup_and_install
#+begin_src sh :session org-sh

# Save current directory where the report is
BASE_DIR=$PWD
apt-get update && apt-get install cmake make gcc git libboost-dev libgct++ libpcre3-dev linux-tools gdb liblua5.1-0-dev libdwarf-dev libunwind7-dev valgrind libsigc++
mkdir -p SimGrid deployment platforms logs fig
cd $BASE_DIR/SimGrid/
# Clone latest SimGrid version. You may have to configure proxy settings if you are in a G5K node in order to clone this git repository
git clone https://gforge.inria.fr/git/simgrid/simgrid.git .
SGPATH='/usr/local'
# Save the revision of SimGrid used for the experiment
SGHASH=$(git rev-parse --short HEAD)
cmake -Denable_compile_optimizations=ON -Denable_supernovae=OFF -Denable_compile_warnings=OFF -Denable_debug=OFF -Denable_gtnets=OFF -Denable_jedule=OFF -Denable_latency_bound_tracking=OFF -Denable_lua=OFF -Denable_model-checking=OFF -Denable_smpi=OFF -Denable_tracing=OFF -Denable_documentation=OFF .
make install
cd ../../
#+end_src

\texttt{generate\_platform\_files}:
#+name: generate_platform_files
#+begin_src python :session :results output

# This function generates a specific platform file for the Chord example.
import random
def platform(nb_nodes, nb_bits, end_date):
  max_id = 2 ** nb_bits - 1
  all_ids = [42]
  res = ["<?xml version='1.0'?>\n"
  "<!DOCTYPE platform SYSTEM \"http://simgrid.gforge.inria.fr/simgrid.dtd\">\n"]
  res.append("<!-- nodes: %d, bits: %d, date: %d -->\n"%(nb_nodes, nb_bits, end_date))
  res.append("<platform version=\"3\">\n"
  "  <process host=\"c-0.me\" function=\"node\"><argument value=\"42\"/><argument value=\"%d\"/></process>\n" % end_date)
  for i in range(1, nb_nodes):
    ok = False
    while not ok:
      my_id = random.randint(0, max_id)
      ok = not my_id in all_ids
    known_id = all_ids[random.randint(0, len(all_ids) - 1)]
    start_date = i * 10
    res.append("  <process host=\"c-%d.me\" function=\"node\"><argument value=\"%d\" /><argument value=\"%d\" /><argument value=\"%d\" /><argument value=\"%d\" /></process>\n" % (i, my_id, known_id, start_date, end_date))
    all_ids.append(my_id)
  res.append("</platform>")
  res = "".join(res)
  f  = open(os.getcwd() + "/platforms/chord%d.xml"%nb_nodes, "w")
  f.write(res)
  f.close()
  return

# This function generates a specific deployment file for the Chord example.
# It assumes that the platform will be a cluster.
def deploy(nb_nodes):
  res = """<?xml version='1.0'?>
<!DOCTYPE platform SYSTEM "http://simgrid.gforge.inria.fr/simgrid.dtd">
<platform version="3">
<AS  id="AS0"  routing="Full">
  <cluster id="my_cluster_1" prefix="c-" suffix=".me"
  		radical="0-%d"	power="1000000000"    bw="125000000"     lat="5E-5"/>
</AS>
</platform>"""%(nb_nodes-1)
  f = open(os.getcwd() + "/deployment/One_cluster_nobb_%d_hosts.xml"%nb_nodes, "w")
  f.write(res)
  f.close()
  return 

# Remember that SIZES was defined as a global variable in the first python code chunk in [[Modifiable Parameters]]
for size in SIZES:
  platform(size, nb_bits, end_date)
  deploy(size)
#+end_src

Optional snippets to check arguments and go to chord folder:

\texttt{check\_args}:
#+name: check_args
#+begin_src sh :session org-sh
echo $sizes
echo $threads
echo $BASE_DIR
#sizes=(1000)
#threads=(1 2)
#BASE_DIR=$PWD
echo $sizes
echo $threads
echo $BASE_DIR
#+end_src

\texttt{go\_to\_chord}:
#+name: go_to_chord
#+begin_src sh :session org-sh
cd $BASE_DIR/SimGrid/examples/msg/chord
echo $BASE_DIR
echo $sizes
echo $threads
make
#+end_src

** Scripts to run benchmarks
This are general scripts that can be used to run all the benchmarks
after the proper modifications were done.

\texttt{testall}:
#+name: testall
#+begin_src sh  :var SG_PATH='/usr/local' :var log_folder="logs" :session org-sh

# This script is to benchmark the Chord simulation that can be found
# in examples/msg/chord folder.
# The benchmark can be done with both Constant and Precise mode, using
# different sizes and number of threads (which can be modified).
# This script also generate a table with all the times gathered, that can ease
# the plotting, compatible with gnuplot/R.
# By now, this script copy all data (logs generated an final table) to a 
# personal frontend-node in Grid5000. This should be modified in the near
# future.

###############################################################################
# MODIFIABLE PARAMETERS: SGPATH, SGHASH, sizes, threads, log_folder, file_table
# host_info, timefmt, cp_cmd, dest.

# Path to installation folder needed to recompile chord
# If it is not set, assume that the path is '/usr/local'
if [ -z "$SG_PATH" ]
then
    SGPATH='/usr/local'
fi

# Save the revision of SimGrid used for the experiment
SGHASH=$(git rev-parse --short HEAD)

# List of sizes to test. Modify this to add different sizes.
if [ -z "$sizes" ]
then
    sizes=(1000 3000)
fi

# Number of threads to test. 
if [ -z "$threads"]
then
    threads=(1 2 4 8 16 24)
fi

# Path where to store logs, and filenames of times table, host info
if [ -z "$log_folder"]
then
    log_folder=$BASE_DIR"/logs"
else
    log_folder=$BASE_DIR"/logs/"$log_folder
fi

if [ ! -d "$log_folder" ]
then
    echo "Creating $log_folder to store logs."
    mkdir -p $log_folder
fi

# Copy all the generated deployment/platform files into chord folder
cp $BASE_DIR/platforms/* .
cp $BASE_DIR/deployment/* .

file_table="timings_$SGHASH.csv"
host_info="host_info.org"
rm -rf $host_info

# The las %U is just to ease the parsing for table
timefmt="clock:%e user:%U sys:%S telapsed:%e swapped:%W exitval:%x max:%Mk avg:%Kk %U"

# Copy command. This way one can use cp, scp and a local folder or a folder in 
# a cluster.
sep=','
cp_cmd='cp'
dest=$log_folder"/." # change for <user>@<node>.grid5000.fr:~/$log_folder if necessary
###############################################################################

###############################################################################
echo "Recompile the binary against $SGPATH"
export LD_LIBRARY_PATH="$SGPATH/lib"
rm -rf chord
gcc chord.c -L$SGPATH/lib -I$SGPATH/include -I$SGPATH/src/include -lsimgrid -o chord

if [ ! -e "chord" ]; then
    echo "chord does not exist"
    exit;
fi
###############################################################################

###############################################################################
# PRINT HOST INFORMATION IN DIFFERENT FILE
set +e
echo "#+TITLE: Chord experiment on $(eval hostname)" >> $host_info
echo "#+DATE: $(eval date)" >> $host_info
echo "#+AUTHOR: $(eval whoami)" >> $host_info
echo " " >> $host_info 

echo "* People logged when experiment started:" >> $host_info
who >> $host_info
echo "* Hostname" >> $host_info
hostname >> $host_info
echo "* System information" >> $host_info
uname -a >> $host_info
echo "* CPU info" >> $host_info
cat /proc/cpuinfo >> $host_info
echo "* CPU governor" >> $host_info
if [ -f /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor ];
then
    cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor >> $host_info
else
    echo "Unknown (information not available)" >> $host_info
fi
echo "* CPU frequency" >> $host_info
if [ -f /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq ];
then
    cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq >> $host_info
else
    echo "Unknown (information not available)" >> $host_info
fi
echo "* Meminfo" >> $host_info
cat /proc/meminfo >> $host_info
echo "* Memory hierarchy" >> $host_info
lstopo --of console >> $host_info
echo "* Environment Variables" >> $host_info
printenv >> $host_info
echo "* Tools" >> $host_info
echo "** Linux and gcc versions" >> $host_info
cat /proc/version >> $host_info
echo "** Gcc info" >> $host_info
gcc -v 2>> $host_info 
echo "** Make tool" >> $host_info
make -v >> $host_info
echo "** CMake" >> $host_info
cmake --version >> $host_info
echo "* SimGrid Version" >> $host_info
grep "SIMGRID_VERSION_STRING" ../../../include/simgrid_config.h | sed 's/.*"\(.*\)"[^"]*$/\1/' >> $host_info
echo "* SimGrid commit hash" >> $host_info
git rev-parse --short HEAD >> $host_info
$($cp_cmd $host_info $dest)
###############################################################################

###############################################################################
# ECHO TABLE HEADERS INTO FILE_TABLE
rm -rf $file_table
tabs_needed=""
for thread in "${threads[@]}"; do
thread_line=$thread_line"\t"$thread
done
thread_line=$thread_line$thread_line
for size in $(seq 1 $((${#threads[@]}-1))); do
tabs_needed=$tabs_needed"\t"
done
echo "#SimGrid commit $SGHASH"     >> $file_table 
echo -e "#\t\tconstant${tabs_needed}precise"     >> $file_table
echo -e "#size/thread$thread_line" >> $file_table
###############################################################################

###############################################################################
# START SIMULATION

test -e tmp || mkdir tmp
me=tmp/`hostname -s`

for size in "${sizes[@]}"; do
    line_table=$size
    # CONSTANT MODE
    for thread in "${threads[@]}"; do
        filename="chord_${size}_threads${thread}_constant.log"
        rm -rf $filename

        if [ ! -f  chord$size.xml ]; then
        ./generate.py -p -n $size -b 32 -e 10000
        fi

        if [ ! -f  One_cluster_nobb_${size}_hosts.xml ]; then
        ./generate.py -d -n $size 
        fi


        echo "$size nodes, constant model, $thread threads"
        cmd="./chord One_cluster_nobb_"$size"_hosts.xml chord$size.xml --cfg=contexts/stack_size:16 --cfg=network/model:Constant --cfg=network/latency_factor:0.1 --log=root.thres:info --cfg=contexts/nthreads:$thread --cfg=contexts/guard_size:0"

        /usr/bin/time -f "$timefmt" -o $me.timings $cmd $cmd 1>/tmp/stdout-xp 2>/tmp/stderr-xp

        if grep "Command terminated by signal" $me.timings ; then
            echo "Error detected:"
            temp_time="errSig"
        elif grep "Command exited with non-zero status" $me.timings ; then
            echo "Error detected:"
            temp_time="errNonZero"
        else
            temp_time=$(cat $me.timings | awk '{print $(NF)}')
        fi

        # param
        cat $host_info >> $filename
        echo "* Experiment settings" >> $filename
        echo "size:$size, constant network, $thread threads" >> $filename
        echo "cmd:$cmd" >> $filename
        #stderr
        echo "* Stderr output" >> $filename
        cat /tmp/stderr-xp >> $filename
        # time
        echo "* Timings" >> $filename
        cat $me.timings >> $filename
        line_table=$line_table$sep$temp_time
        $($cp_cmd $filename $dest)
        rm -rf $filename
        rm -rf $me.timings
    done    

    #PRECISE MODE    
    for thread in "${threads[@]}"; do
        echo "$size nodes, precise model, $thread threads"
        filename="chord_${size}_threads${thread}_precise.log"

        cmd="./chord One_cluster_nobb_"$size"_hosts.xml chord$size.xml --cfg=contexts/stack_size:16 --cfg=maxmin/precision:0.00001 --log=root.thres:info --cfg=contexts/nthreads:$thread --cfg=contexts/guard_size:0"

        /usr/bin/time -f "$timefmt" -o $me.timings $cmd $cmd 1>/tmp/stdout-xp 2>/tmp/stderr-xp

        if grep "Command terminated by signal" $me.timings ; then
            echo "Error detected:"
            temp_time="errSig"
        elif grep "Command exited with non-zero status" $me.timings ; then
            echo "Error detected:"
            temp_time="errNonZero"
        else
            temp_time=$(cat $me.timings | awk '{print $(NF)}')
        fi
        # param
        cat $host_info >> $filename
        echo "* Experiment settings" >> $filename
        echo "size:$size, constant network, $thread threads" >> $filename
        echo "cmd:$cmd" >> $filename
        #stderr
        echo "* Stderr output" >> $filename
        cat /tmp/stderr-xp >> $filename
        # time
        echo "* Timings" >> $filename
        cat $me.timings >> $filename
        line_table=$line_table$sep$temp_time
        $($cp_cmd $filename $dest)
        rm -rf $filename
        rm -rf $me.timings
    done

    echo -e $line_table >> $file_table

done

$($cp_cmd $file_table $dest)
rm -rf $file_table
rm -rf tmp
#+end_src

\texttt{testall\_sr}:
#+name: testall_sr
#+begin_src sh  :var SG_PATH='/usr/local' :var log_folder="logs" :session org-sh
# This script is to benchmark the Chord simulation that can be found
# in examples/msg/chord folder.
# The benchmark is done with both Constant and Precise mode, using
# different sizes and number of threads (which can be modified).
# This script also generate a table with all the times gathered, that can ease
# the plotting, compatible with gnuplot/R.
# By now, this script copy all data (logs generated an final table) to a 
# personal frontend-node in Grid5000. This should be modified in the near
# future.

###############################################################################
# MODIFIABLE PARAMETERS: SGPATH, SGHASH, sizes, threads, log_folder, file_table
# host_info, timefmt, cp_cmd, dest.

# Path to installation folder needed to recompile chord
# If it is not set, assume that the path is '/usr/local'
if [ -z "$SG_PATH" ]
then
    SGPATH='/usr/local'
fi

# Save the revision of SimGrid used for the experiment
SGHASH=$(git rev-parse --short HEAD)

# List of sizes to test. Modify this to add different sizes.
if [ -z "$sizes" ]
then
    sizes=(1000 3000)
fi

# Number of threads to test. 
if [ -z "$threads"]
then
    threads=(1 2 4 8 16 24)
fi

# Path where to store logs, and filenames of times table, host info
if [ -z "$log_folder"]
then
    log_folder=$BASE_DIR"/logs"
else
    log_folder=$BASE_DIR"/logs/"$log_folder
fi

if [ ! -d "$log_folder" ]
then
    echo "Creating $log_folder to store logs."
    mkdir -p $log_folder
fi

# Copy all the generated deployment/platform files into chord folder
cp $BASE_DIR/platforms/* .
cp $BASE_DIR/deployment/* .

file_table="timings_$SGHASH.csv"
host_info="host_info.org"
rm -rf $host_info

# The las %U is just to ease the parsing for table
timefmt="clock:%e user:%U sys:%S telapsed:%e swapped:%W exitval:%x max:%Mk avg:%Kk %U"

# Copy command. This way one can use cp, scp and a local folder or a folder in 
# a cluster.
sep=','
cp_cmd='cp'
dest=$log_folder # change for <user>@<node>.grid5000.fr:~/$log_folder if necessary
###############################################################################

###############################################################################
echo "Recompile the binary against $SGPATH"
export LD_LIBRARY_PATH="$SGPATH/lib"
rm -rf chord
gcc chord.c -L$SGPATH/lib -I$SGPATH/include -I$SGPATH/src/include -lsimgrid -o chord

if [ ! -e "chord" ]; then
    echo "chord does not exist"
    exit;
fi
###############################################################################

###############################################################################
# PRINT HOST INFORMATION IN DIFFERENT FILE
set +e
echo "#+TITLE: Chord experiment on $(eval hostname)" >> $host_info
echo "#+DATE: $(eval date)" >> $host_info
echo "#+AUTHOR: $(eval whoami)" >> $host_info
echo " " >> $host_info 

echo "* People logged when experiment started:" >> $host_info
who >> $host_info
echo "* Hostname" >> $host_info
hostname >> $host_info
echo "* System information" >> $host_info
uname -a >> $host_info
echo "* CPU info" >> $host_info
cat /proc/cpuinfo >> $host_info
echo "* CPU governor" >> $host_info
if [ -f /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor ];
then
    cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor >> $host_info
else
    echo "Unknown (information not available)" >> $host_info
fi
echo "* CPU frequency" >> $host_info
if [ -f /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq ];
then
    cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq >> $host_info
else
    echo "Unknown (information not available)" >> $host_info
fi
echo "* Meminfo" >> $host_info
cat /proc/meminfo >> $host_info
echo "* Memory hierarchy" >> $host_info
lstopo --of console >> $host_info
echo "* Environment Variables" >> $host_info
printenv >> $host_info
echo "* Tools" >> $host_info
echo "** Linux and gcc versions" >> $host_info
cat /proc/version >> $host_info
echo "** Gcc info" >> $host_info
gcc -v 2>> $host_info 
echo "** Make tool" >> $host_info
make -v >> $host_info
echo "** CMake" >> $host_info
cmake --version >> $host_info
echo "* SimGrid Version" >> $host_info
grep "SIMGRID_VERSION_STRING" ../../../include/simgrid_config.h | sed 's/.*"\(.*\)"[^"]*$/\1/' >> $host_info
echo "* SimGrid commit hash" >> $host_info
git rev-parse --short HEAD >> $host_info
$($cp_cmd $host_info $dest)
###############################################################################

###############################################################################
# ECHO TABLE HEADERS INTO FILE_TABLE
rm -rf $file_table
tabs_needed=""
for thread in "${threads[@]}"; do
thread_line=$thread_line"\t"$thread
done
thread_line=$thread_line$thread_line
for size in $(seq 1 $((${#threads[@]}-1))); do
tabs_needed=$tabs_needed"\t"
done
echo "#SimGrid commit $SGHASH"     >> $file_table 
echo -e "#\t\tconstant${tabs_needed}precise"     >> $file_table
echo -e "#size/thread$thread_line" >> $file_table
###############################################################################

###############################################################################
# START SIMULATION

test -e tmp || mkdir tmp
me=tmp/`hostname -s`

for size in "${sizes[@]}"; do
    line_table=$size
    # CONSTANT MODE
    for thread in "${threads[@]}"; do
        filename="chord_${size}_threads${thread}_constant.log"
    	output="sr_${size}_threads${thread}_constant.log"
        rm -rf $filename

        if [ ! -f  chord$size.xml ]; then
        ./generate.py -p -n $size -b 32 -e 10000
        fi

        if [ ! -f  One_cluster_nobb_${size}_hosts.xml ]; then
        ./generate.py -d -n $size 
        fi


        echo "$size nodes, constant model, $thread threads"
        cmd="./chord One_cluster_nobb_"$size"_hosts.xml chord$size.xml --cfg=contexts/stack_size:16 --cfg=network/model:Constant --cfg=network/latency_factor:0.1 --log=root.thres:critical --cfg=contexts/nthreads:$thread --cfg=contexts/guard_size:0"

        /usr/bin/time -f "$timefmt" -o $me.timings $cmd $cmd 1>/tmp/stdout-xp 2>/tmp/stderr-xp

        if grep "Command terminated by signal" $me.timings ; then
            echo "Error detected:"
            temp_time="errSig"
        elif grep "Command exited with non-zero status" $me.timings ; then
            echo "Error detected:"
            temp_time="errNonZero"
        else
            temp_time=$(cat $me.timings | awk '{print $(NF)}')
        fi

        # param
        cat $host_info >> $filename
        echo "* Experiment settings" >> $filename
        echo "size:$size, constant network, $thread threads" >> $filename
        echo "cmd:$cmd" >> $filename
        #stdout
        echo "* Stdout output" >> $filename
        cat /tmp/stdout-xp | grep Amdahl >> $filename
        #stderr
        echo "* Stderr output" >> $filename
        cat /tmp/stderr-xp >> $filename
        # time
        echo "* Timings" >> $filename
        cat $me.timings >> $filename
        line_table=$line_table$sep$temp_time
        # Gather SR data from logs
        echo -e '#id_sr\ttime_taken\tamount_proccesses' >> $output
        grep 'Total time SR' $filename | awk '{print $7 "\x09" $9 "\x09" $10}' | tr -d ',' >> $output
        $($cp_cmd $output $dest)
        $($cp_cmd $filename $dest)
        rm -rf $filename $output
        rm -rf $me.timings
    done    

    #PRECISE MODE    
    for thread in "${threads[@]}"; do
        echo "$size nodes, precise model, $thread threads"
        filename="chord_${size}_threads${thread}_precise.log"
    	output="sr_${size}_threads${thread}_precise.log"

        cmd="./chord One_cluster_nobb_"$size"_hosts.xml chord$size.xml --cfg=contexts/stack_size:16 --cfg=maxmin/precision:0.00001 --log=root.thres:critical --cfg=contexts/nthreads:$thread --cfg=contexts/guard_size:0"

        /usr/bin/time -f "$timefmt" -o $me.timings $cmd $cmd 1>/tmp/stdout-xp 2>/tmp/stderr-xp

        if grep "Command terminated by signal" $me.timings ; then
            echo "Error detected:"
            temp_time="errSig"
        elif grep "Command exited with non-zero status" $me.timings ; then
            echo "Error detected:"
            temp_time="errNonZero"
        else
            temp_time=$(cat $me.timings | awk '{print $(NF)}')
        fi
        # param
        cat $host_info >> $filename
        echo "* Experiment settings" >> $filename
        echo "size:$size, constant network, $thread threads" >> $filename
        echo "cmd:$cmd" >> $filename
        #stderr
        echo "* Stderr output" >> $filename
        cat /tmp/stderr-xp >> $filename
        # time
        echo "* Timings" >> $filename
        cat $me.timings >> $filename
        line_table=$line_table$sep$temp_time
        # Gather SR data from logs
        echo -e '#id_sr\ttime_taken\tamount_proccesses' >> $output
        grep 'Total time SR' $filename | awk '{print $7 "\x09" $9 "\x09" $10}' | tr -d ',' >> $output
        $($cp_cmd $output $dest)
        $($cp_cmd $filename $dest)
        rm -rf $filename $output
        rm -rf $me.timings
    done
    echo -e $line_table >> $file_table
done

$($cp_cmd $file_table $dest)
rm -rf $file_table
rm -rf tmp
#+end_src

** Baseline Performance
The benchmark can be run from this org-mode file, or simply by running
\texttt{./scripts/chord/testall.sh} inside the folder
\texttt{examples/msg/chord} of your SimGrid installation. Inside that script,
the number of threads to test, as well as the amount of nodes, can be
modified

The script generates a .csv table, but just in case it is done in
different stages, the resulting logs can be processed with
\texttt{./scripts/chord/get\_times.py} (located in the same folder as
testall.sh). This generates a .csv that can easily be plotted with
R/gnuplot.

The script is self-documented.

#+call: check_args[:session org-sh]()

#+call: go_to_chord[:session org-sh]()

#+call: testall[:session org-sh](log_folder='timings/logs')

** SR Distribution
To enable Scheduling Rounds benchmarks, the constant \texttt{TIME\_BENCH\_ENTIRE\_SRS}
has to be defined. It can be defined in \texttt{src/simix/smx\_private.h}
The logs give information about the time it takes to run a scheduling
round, as well as the amount of processes each SR takes.
For this experiment, we are only interested in the amount of processes
taken by each SR.

The script to run this experiment is
\texttt{./scripts/chord/testall\_sr.sh}. It gathers data about the id
of each SR, time of each SR and num processes of SR, in stores it in
table format.

#+call: check_args[:session org-sh]()

#+call: go_to_chord[:session org-sh]()

#+call: testall_sr[:session org-sh](log_folder='sr_counts/logs')

** SR Times
The data set used for this plot is the same as the one before.
We just use the data of the sequential simulations (1 thread).
** Binding threads to physical cores
The constant \texttt{CORE\_BINDING} has to be defined in
\texttt{include/xbt/xbt\_os\_thread.h} in order to enable this
optimization.  The benchmark is then run in the same way as the Amdahl
Speedup experiment.

#+call: check_args[:session org-sh]()

#+call: go_to_chord[:session org-sh]()

#+call: testall[:session org-sh](log_folder='binding_cores/logs')
** parmap between N cores
This may be the experiment that requires more work to reproduce:
*** maestro works with N-1 threads
This is the default setting and the standard benchmark can be used.

#+call: check_args[:session org-sh]()

#+call: go_to_chord[:session org-sh]()

#+call: testall[:session org-sh](log_folder='pmapM_N-1/logs')

*** maestro sleeps with N-1 threads
To avoid that maestro works with the threads, comment out the line:
    \texttt{xbt\_parmap\_work(parmap);}
from the function \texttt{xbt\_parmap\_apply()} in \texttt{src/xbt/parmap.c}

#+call: check_args[:session org-sh]()

#+call: go_to_chord[:session org-sh]()

#+call: testall[:session org-sh](log_folder='pmap_N-1/logs')

*** maestro sleeps with N threads
To avoid that maestro works with the threads, comment out the line:
    \texttt{xbt\_parmap\_work(parmap);}
from the function \texttt{xbt\_parmap\_apply()} in \texttt{src/xbt/parmap.c}
Then the function \texttt{src/xbt/parmap.c:xbt\_parmap\_new} has to be
modified to create one extra thread. It is easy: just add 1 to
\texttt{num\_workers} parameter.

#+call: check_args[:session org-sh]()

#+call: go_to_chord[:session org-sh]()

#+call: testall[:session org-sh](log_folder='pmap_N/logs')

** Busy Waiters vs. Futexes performance
Enable the use of busy waiters running chord with the extra option:
    \texttt{--cfg=contexts/synchro:busy\_wait}
The experiment was run with testall.sh using that extra option in the
chord command inside the script. The tables were constructed using \texttt{get\_times.py}.
The data regarding the futexes times is the same gathered in Baseline Performance experiment.

#+call: check_args[:session org-sh]()

#+call: go_to_chord[:session org-sh]()

#+call: testall[:session org-sh](log_folder='busy_waiters/logs')

** SR parallel threshold
The data set is the same as SR Distribution and SR times experiments.
** Adaptive threshold
The benchmark is done using testall.sh. The algorithm is the one
described in section 5.2, and it can be enabled by defining the 
constant \texttt{ADAPTIVE\_ALGORITHM} in \texttt{src/simix/smx\_private.h}

#+call: check_args[:session org-sh]()

#+call: go_to_chord[:session org-sh]()

#+call: testall[:session org-sh](log_folder='adaptive-algorithm/logs')

* Data Analysis                                                    :noexport: 
** Installing required packages
#+begin_src R :exports none
install.packages("ggplot2")
install.packages("gridExtra")
install.packages("reshape")
install.packages("plyr")
install.packages("data.table")
install.packages("stringr")
install.packages("grid")
#+end_src

** Libraries/Auxiliary functions
#+begin_src R  :exports none
# If you miss the libraries, try typing >>>install.packages("data.table")<<< in a R console
library('ggplot2')
library('gridExtra')
library('reshape')
library('plyr')
library('data.table')
library('stringr')
require('grid')
# To plot several ggplot in one window.
vp.layout <- function(x, y) viewport(layout.pos.row=x, layout.pos.col=y)
arrange_ggplot2 <- function(..., nrow=NULL, ncol=NULL, as.table=FALSE) {
    dots <- list(...)
    n <- length(dots)
    if(is.null(nrow) & is.null(ncol)){
        nrow = floor(n/2) ; ncol = ceiling(n/nrow)
    }
    if(is.null(nrow)){
        nrow = ceiling(n/ncol)
    }
    if(is.null(ncol)){
        ncol = ceiling(n/nrow)
    }
    grid.newpage()
    pushViewport(viewport(layout=grid.layout(nrow,ncol)))
    ii.p <- 1
    for(ii.row in seq(1, nrow)){
        ii.table.row <- ii.row
        if(as.table) {
            ii.table.row <- nrow - ii.table.row + 1
        }
        for(ii.col in seq(1, ncol)){
            ii.table <- ii.p
            if(ii.p > n) break
            print(dots[[ii.table]], vp=vp.layout(ii.table.row, ii.col))
            ii.p <- ii.p + 1
        }
    }
}

# Get legend from a given plot
g_legend<-function(a.gplot){
    tmp <- ggplot_gtable(ggplot_build(a.gplot))
    leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
    legend <- tmp$grobs[[leg]]
    return(legend)
}
#+end_src

#+RESULTS:

** Pre-processing of datasets
The .csv files needed for almost all plots are created here, as well
as some R data sets that speed things up a little bit.

#+name: process_data_sr-times
#+begin_src R
temp = list.files(path='./logs/sr_counts/logs', pattern="sr_20000_threads1_precise.log", full.names = TRUE)
flist <- lapply(temp, read.table)
sr_data <- rbindlist(flist)
sr_data[, "V1"] <- NULL
sr_data = as.data.frame.matrix(sr_data)
saveRDS(sr_data, file="./logs/sr_counts/sr-times.Rda")
#+end_src

#+name: process_data_sr-par-threshold
#+begin_src R
#PRECISE MODE
#SEQUENTIAL
temp = list.files(path='./logs/sr_counts/logs', pattern="threads1_", full.names = TRUE)
temp <- temp[grepl("precise", temp)]
temp <- temp[grepl("25000", temp)]
#temp <- temp[-grep("50000", temp)]
#temp <- temp[-grep("75000", temp)]
flist <- lapply(temp, read.table)
sr_data <- rbindlist(flist)
#sr_data[, "V1"] <- NULL
sr_data = as.data.frame.matrix(sr_data)
#df <- ddply(sr_data, .(V3), summarize, mean_value = mean(V2))

#PARALLEL:
temp2 = list.files(path='./logs/sr_counts/logs', pattern="threads4_", full.names = TRUE)
temp2 <- temp2[grepl("precise", temp2)]
temp2 <- temp2[grepl("25000", temp2)]
flist2 <- lapply(temp2, read.table)
sr_data2 <- rbindlist(flist2)
#sr_data2[, "V1"] <- NULL
sr_data2 = as.data.frame.matrix(sr_data2)
#df2 <- ddply(sr_data2, .(V3), summarize, mean_value = mean(V2))

#CONSTANT MODE
#SEQUENTIAL
#temp3 = list.files(path='./logs/sr_counts/sequential', pattern="threads4_", full.names = TRUE)
#temp3 <- temp3[grepl("constant", temp3)]
#flist <- lapply(temp3, read.table)
#sr_data3 <- rbindlist(flist)
#sr_data3[, "V1"] <- NULL
#sr_data3 = as.data.frame.matrix(sr_data3)
#df3 <- ddply(sr_data3, .(V3), summarize, mean_value = mean(V2))


#PARALLEL:
#temp4 = list.files(path='./logs/sr_counts/parallel', pattern="threads4_", full.names = TRUE)
#temp4 <- temp4[grepl("constant", temp4)]
#temp4 <- temp4[-grep("50000", temp4)]
#temp4 <- temp4[-grep("75000", temp4)]
#flist2 <- lapply(temp4, read.table)
#sr_data4 <- rbindlist(flist2)
#sr_data4[, "V1"] <- NULL
#sr_data4 = as.data.frame.matrix(sr_data4)
#df4 <- ddply(sr_data4, .(V3), summarize, mean_value = mean(V2))

#Merge PRECISE datasets
df5 = merge(sr_data, sr_data2, by = 'V1', incomparables = NULL)
df5 <- transform(df5, speedup = V2.x / V2.y)
saveRDS(df5, file="./logs/sr_counts/precise.Rda")
#Merge CONSTANT datasets
#df6 = merge(sr_data3, sr_data4, by = 'V1', incomparables = NULL)
#df6 <- transform(df6, speedup = V2.x / V2.y)
#df6[, 'speedup'] <- df6[,'mean_value.x'] / df6[, 'mean_value.y']
#saveRDS(df6,file="./logs/sr_counts/constant.Rda")
#+end_src

#+name: see_percentages_sr-par-threshold
#+begin_src: R
precise <- readRDS(file="./logs/sr_counts/logs/precise_10000.Rda")

under_500 <- precise[precise$V3.x<250,]
under_500 <- under_500[complete.cases(under_500),]
under_500 <- under_500[is.finite(rowSums(under_500)), ]
num_under_500 <- nrow(under_500)

# to calculate percentage of SR's with less than 500 processes that had speedup.
a <- under_500[under_500$speedup > 1,]
n_speedup <- nrow(a)

b <- under_500[under_500$speedup <= 1,]
n_no_speedup <- nrow(b)

# Percentage of SR with less than 500 processes that had/hadnt speedup
perc_speedup <- (n_speedup * 100) / num_under_500
perc_no_speedup <- (n_no_speedup * 100) / num_under_500
#+end_src

# OPTIONAL: Maybe you want to call this function to be sure that the THREADS and SIZES are the ones you want to plot.
#+call: set_python_args() :session

#+name: create_table
#+begin_src python :session :var elapsed=0 :var amdahl=0 :var memory=0 :var logs_path='"logs"' :var output_file='"logs/total_times.csv"' :results output
  # This is a set of functions that can generate nice .csv files with
  # the times of the experiments. Also, the memory consumption can be
  # gathered. Note that the logs are the ones generated by [[testall]]
  # code chunk.

  #Parameters: elapsed: if set to True, then the elapsed time (wallclock) is gathered.
  #            amdahl:  if set to True, then the times of the Amdahl benchmark are gathered.
  #            memory:  if set to True, then the peak RAM used by the process is gathered.
  #               If none of them is gathered, then the usrtime + systime is gathered.
  #            logs_path: where are stored the logs to analyze.
  #            output_file_path: where to store the produced table

  # If you make several test of the same experiment, you can name the log files
  # with a prefix ('1_chord..., 2_chord...') and then put the prefixes
  # you used in input_seq. The script will average the corresponding values
  # for you.
  input_seq = ['']


  def parse_elapsed_and_memory_used(file):
      line = file.read().splitlines()
      l = line[-1]
      if l:
          t = float((l.split()[0]).split(':')[1])
          mem = float(((l.split()[6]).split(':')[1]).replace('k', ''))
          mem = mem / (1024.0 * 1024.0)  # gigabytes used
          mem = float(("{0:.2f}".format(mem)))
          return (t, mem)
      else:
          return (0, 0)


  def parse_memory_used(file):
      line = file.read().splitlines()
      l = line[-1]
      if l:
          mem = float(((l.split()[6]).split(':')[1]).replace('k', ''))
          mem = mem / (1024.0 * 1024.0)  # gigabytes used
          mem = float(("{0:.2f}".format(mem)))
          return mem
      else:
          return 0


  def parse_elapsed_real(file):
      line = file.read().splitlines()[-1]
      if line:
          return float((line.split()[0]).split(':')[1])
      else:
          return 0


  def parse_user_kernel(file):
      line = file.read().splitlines()[-1]
      if line:
          usrtime = float((line.split(":")[2]).split()[0])
          systime = float((line.split(":")[3]).split()[0])
          return usrtime + systime


  def parse_amdahl_times(file):
      line = [line for line in file.read().splitlines() if "Amdahl" in line]
      line = [(((l.split(";")[0]).split(":")[-1]).strip(),
              ((l.split(";")[1]).split(":")[1]).strip())
              for l in line][0]
      return float(line[0]) + float(line[1])


  def print_header(file):
      file.write('"nodes"')
      for mode in MODES:
          for thread in THREADS:
              file.write(',"'+mode[0]+str(thread)+'"')
      file.write('\n')


  def parse_files(elapsed, amdahl, mem, logs_path, output_file):
      f = open(output_file, "w")
      print_header(f)
      for size in SIZES:
          temp_line = "{}".format(size)
          for mode in MODES:
              for thread in THREADS:
                  sum_l = 0.
                  mem_used = 0.
                  leng = len(input_seq)
                  for seq in input_seq:
                      file = open("{}/chord{}_{}_threads{}_{}.log".format(logs_path,
                                  seq, size, thread, mode), "r")
                      if mem and elapsed:
                          tup = parse_elapsed_and_memory_used(file)
                          sum_l += tup[0]
                          mem_used += tup[1]
                      elif elapsed:
                          sum_l += parse_elapsed_real(file)
                      elif amdahl:
                          sum_l += parse_amdahl_times(file)
                      elif mem:
                          sum_l += parse_memory_used(file)
                      else:
                          sum_l += parse_user_kernel(file)
                  if leng != 0:
                      if mem and elapsed:
                          temp_line += ",{0},{1:.2f}".format(datetime.timedelta(seconds=int(sum_l / float(leng))),
                                                             (mem_used / float(leng)))
                      else:
                          temp_line += ",{}".format(sum_l / float(leng))
                  else:
                      if mem and elapsed:
                          temp_line += ",?,?"
                      else:
                          temp_line += ",?"
          f.write(temp_line + "\n")
      f.close()

  parse_files(elapsed, amdahl, memory, logs_path, output_file)
#+end_src

#+call: create_table(0,1,0,'"logs/amdahl/logs"','"logs/amdahl/total_times_amdahl.csv"') :session
#+call: create_table(1,0,0,'"logs/timings/logs"','"logs/timings/total_times.csv"') :session
#+call: create_table(0,0,1,'"logs/timings/logs"','"logs/timings/memory_consumption.csv"') :session

# Call this to change the amount of threads: in the next 2 tables, we dont take the serial benchmarks.
#+begin_src python :session
  # We only test performance improvements in parallel executions with
  # adaptive algorithm and busy_waiters.
  THREADS = [2, 4, 8, 16, 24]
#+end_src

#+call: create_table(1,0,0,'"logs/busy_waiters/logs"','"logs/busy_waiters/total_times_busy.csv"')  :session
#+call: create_table(1,0,0,'"logs/adaptive_algorithm/logs"','"logs/adaptive_algorithm/total_times_adaptive.csv"') :session

# OPTIONAL: This csv is useful for the table of Section 5.2

#+call: create_table(1,0,1,'"logs/timings/logs"','"logs/timings/total_times_memory_adaptive.csv"') :session

** Plotting
#+name: baseline_perf
#+begin_src R  :results output graphics :exports results :file fig/baseline-perf.pdf
data = read.csv("./logs/timings/total_times.csv", head=TRUE, sep=',')

# Speedups of Precise Mode
data[, "baseline"] <- data[, "p1"]  / data[, "p1"]
data[, "2"] <- data[, "p1"]  / data[, "p2"]
data[, "4"] <- data[, "p1"]  / data[, "p4"]
data[, "8"] <- data[, "p1"]  / data[, "p8"]
data[, "16"] <- data[,"p1"]  / data[, "p16"]
data[, "24"] <- data[,"p1"]  / data[, "p24"]
keep <- c("nodes", colnames(data)[grep("^[1-9]", colnames(data))], "baseline")
speedup_precise <- data[keep]
df2 <- melt(speedup_precise ,  id = 'nodes', variable_name = 'threads')
g2<-ggplot(df2, aes(x=nodes,y=value, group=threads, colour=threads)) + geom_line() +
    theme(axis.text.x = element_text(angle = -45, hjust = 0),
    panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(),
    panel.background = element_blank(), axis.line=element_line(),
    legend.position="right") + xlab("Amount of nodes simulated") + ylab("Speedup-Precise mode") + 
    scale_fill_discrete(name="threads") + scale_x_continuous(breaks=df2$nodes)
g2
#+end_src

#+name: sr-distribution
#+begin_src R :results output graphics :exports results :file fig/sr-distribution.pdf
temp = list.files(path='./logs/sr_counts/logs', pattern="threads4", full.names = TRUE)
temp <- temp[grep("precise",temp)]
# This data.frame will store the final proportion values.
#proportions <- data.frame(stringsAsFactors=FALSE)
proportions <- data.frame(row.names = c('1','2','3-5','6-10','11-20','21-30','31+'))
head <- c()
for(i in temp){
    col <- c()
    # Parse amount of nodes from the file path.
    # Example of file path: './logs/sr_counts/parallel/sr_10000_threads4_constant.log'
    nodes = strsplit(str_extract(i, "_[0-9]+_"), "_")[[1]][2]
    head <- c(head,as.numeric(nodes))
    col <- c(col, nodes)
    # Keep only the column with the amount of processes
    data <- read.table(i)["V3"]
    # Calculate proportions
    data <- prop.table(xtabs(~ V3, data=data))
    # Populate a new data frame with percentages of interest (1, 2, 3 or more processes)
    proc1 <- data["1"][[1]]
    proc2 <- data["2"][[1]]
    proc3_5 <- c(data["3"][[1]],data["4"][[1]], data["5"][[1]])
    proc6_10 <- c(data["6"][[1]], data["7"][[1]], data["8"][[1]], data["9"][[1]], data["10"][[1]])
    proc11_20 <- c(data["11"][[1]], data["12"][[1]], data["13"][[1]], data["14"][[1]], data["15"][[1]], data["16"][[1]], data["17"][[1]], data["18"][[1]], data["19"][[1]], data["20"][[1]])
    proc21_30 <- c(data["21"][[1]], data["22"][[1]], data["23"][[1]], data["24"][[1]], data["25"][[1]], data["26"][[1]], data["27"][[1]], data["28"][[1]], data["29"][[1]], data["30"][[1]])
    # Calculate final percentages and omit any possible NA
    proc3_5 <- Reduce("+", proc3_5[!is.na(proc3_5)])
    proc6_10 <- Reduce("+", proc6_10[!is.na(proc6_10)])
    proc11_20 <- Reduce("+", proc11_20[!is.na(proc11_20)])
    proc21_30 <- Reduce("+", proc21_30[!is.na(proc21_30)])
    proc31 <- 1 - (proc1 + proc2 + proc3_5 + proc6_10 + proc11_20 + proc21_30)
    #p <- c(nodes, proc1, proc2, proc3_5, proc6_10, proc11_20, proc21_30, proc31)
    # And bind to existing data.frame
    #p <- as.data.frame(p)
    #p[,'nodes'] <- nodes
    #p[,'process'] <- c("1","2",">3")
    proportions <- cbind(proportions, nodes = c(proc1, proc2, proc3_5, proc6_10, proc11_20, proc21_30, proc31))
    colnames(proportions)[length(proportions)] <- as.numeric(nodes)
}
head <- sort(head)
cols <- c()
for(e in head){ cols <- c(cols,toString(e))}
proportions <- proportions[,cols]
b <- barplot(as.matrix(proportions), ylab="Proportion of SR's having different number of processes",
legend=rownames(proportions), args.legend = list(x = ncol(proportions) + 5.5, bty = "n"),
xlim=c(0, ncol(proportions) + 4), las=2, cex.axis = 0.8)
title(xlab = "Amount of nodes simulated", line=4)

#df <- ddply(proportions, .(nodes,process), summarise, msteps = mean(p))
#g<-ggplot(df, aes(x=nodes, y=msteps, group=process, colour=process)) + geom_line() +
#   theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
#   panel.background = element_blank(), axis.line=element_line()) +
#   scale_fill_discrete(name="threads") +
#   xlab("Amount of nodes simulated") + ylab("Percentage of SR's containing 1,2 or >3 processes")
#g
#+end_src

#+name: sr-times
#+begin_src R  :results output graphics :exports results  :file fig/sr-times.pdf
sr_data <- readRDS(file="./logs/sr_counts/sr-times.Rda")
#df <- ddply(sr_data, .(V3), summarize, mean_value = mean(V2))

# Replace V2 for 'mean_value' if dont want to plot every dot; and uncomment line above.
ggplot(data=sr_data, geom="histogram", aes(x=V3, y=mean_value)) + xlim(0,4000) + ylim(0,0.02) +
xlab("Number of processes computed in SR's") + ylab("Average time consumed (seconds)") + geom_point(size = 1) +
theme(panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(),
    panel.background = element_blank(), axis.line=element_line(),
    legend.position="none") 
#+end_src

#+name: busy
#+begin_src R :results output graphics :exports results :file fig/busy.pdf
orig_data = read.csv("./logs/busy_waiters/total_times_orig.csv", head=TRUE, sep=',')
opt_data = read.csv("./logs/busy_waiters/total_times_busy.csv", head=TRUE, sep=',')

# Speedups of Precise Mode
opt_data[, "baseline"]  <- orig_data[, "p2"]  / orig_data[, "p2"]
opt_data[, "2"]  <- orig_data[, "p2"]  / opt_data[, "p2"]
opt_data[, "4"]  <- orig_data[, "p4"]  / opt_data[, "p4"]
opt_data[, "8"]  <- orig_data[, "p8"]  / opt_data[, "p8"]
opt_data[, "16"] <- orig_data[, "p16"] / opt_data[, "p16"]
opt_data[, "24"] <- orig_data[, "p24"] / opt_data[, "p24"]
keep <- c("nodes", colnames(opt_data)[grep("^[1-9]", colnames(opt_data))], "baseline")
speedup_precise <- opt_data[keep]

df2 <- melt(speedup_precise ,  id = 'nodes', variable_name = 'threads')

g2<-ggplot(df2, aes(x=nodes,y=value, group=threads, colour=threads)) + geom_line() +
    theme(axis.text.x = element_text(angle = -45, hjust = 0),
    panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(),
    panel.background = element_blank(), axis.line=element_line(),
    legend.position="right") + ylab("Speedup") +
    xlab("Amount of nodes simulated") + scale_x_continuous(breaks=df2$nodes)
g2
#+end_src

#+name: sr-par-threshold
#+begin_src R :results output graphics :exports results   :file fig/sr-par-threshold_40000.png
precise <- readRDS(file="./logs/sr_counts/logs/precise_40000.Rda")

ggplot(data=precise, geom="histogram", aes(x=V3.x, y=speedup)) +geom_point() +
 xlim(1,500) +ylim(0,2) +
 theme(panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(),
    panel.background = element_blank(), axis.line=element_line(),
    legend.position="none") +
 ylab("Speedup of parallel execution against sequential execution") +
 xlab("Amount of processes computed by each SR")
#+end_src

#+name: adapt-algorithm
#+begin_src R  :results output graphics :exports results  :file fig/adapt-algorithm.pdf
orig_data = read.csv("./logs/adaptive_algorithm/total_times_orig.csv")
opt_data = read.csv("./logs/adaptive_algorithm/total_times_adaptive.csv")

# Speedups of Precise Mode
opt_data[, "baseline"]  <- orig_data[, "p2"]  / orig_data[, "p2"]
opt_data[, "2"]  <- orig_data[, "p2"]  / opt_data[, "p2"]
opt_data[, "4"]  <- orig_data[, "p4"]  / opt_data[, "p4"]
opt_data[, "8"]  <- orig_data[, "p8"]  / opt_data[, "p8"]
opt_data[, "16"] <- orig_data[, "p16"] / opt_data[, "p16"]
opt_data[, "24"] <- orig_data[, "p24"] / opt_data[, "p24"]
keep <- c("nodes", colnames(opt_data)[grep("^[1-9]+", colnames(opt_data))], "baseline")
speedup_precise <- opt_data[keep]

df2 <- melt(speedup_precise ,  id = 'nodes', variable_name = 'threads')

g2<-ggplot(df2, aes(x=nodes,y=value, group=threads, colour=threads)) + geom_line() + scale_fill_hue() + theme(axis.text.x = element_text(angle = -45, hjust = 0),
    panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(),
    panel.background = element_blank(), axis.line=element_line(),
    legend.position="right") + 
    scale_y_continuous(breaks=c(1,2))+ ylab("Speedup-Precise mode") +
    xlab("Amount of nodes simulated")
g2
#+end_src

#+name: memory-consumption
#+begin_src R :results output graphics :exports results  :file fig/memory-consumption.pdf
data = read.csv("./logs/timings/memory_consumption.csv", head=TRUE, sep=',')
df2 <- melt(data,  id = 'nodes', variable_name = 'threads')
g2<-ggplot(df2, aes(x=nodes,y=value, group=threads, colour=threads)) + geom_line() + theme(axis.text.x = element_text(angle = -45, hjust = 0),
    panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(),
    panel.background = element_blank(), axis.line=element_line(),
    legend.position="right") + xlab("Amount of nodes simulated") + ylab("Memory Consumption (GB)") + 
    scale_fill_discrete(name="threads") + scale_color_manual(values=c('brown1','darkblue','darkorange2','cadetblue2','gold','hotpink4'),labels = c("1","2","4","8","16","24"))
g2
#+end_src

#+name: real-elapsed-times
#+begin_src R :results output graphics :exports results  :file fig/real-elapsed-times.pdf
data = read.csv("./logs/timings/total_times.csv", head=TRUE, sep=',')
df2 <- melt(data,  id = 'nodes', variable_name = 'threads')

g2<-ggplot(df2, aes(x=nodes,y=value, group=threads, colour=threads)) + geom_line() + theme(axis.text.x = element_text(angle = -45, hjust = 0),
    panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(),
    panel.background = element_blank(), axis.line=element_line(),
    legend.position="right") + xlab("Amount of nodes simulated") + ylab("Elapsed time of simulation (seconds)") + 
    scale_fill_discrete(name="threads") + scale_color_manual(values=c('brown1','darkblue','darkorange2','cadetblue2','gold','hotpink4'),labels = c("1","2","4","8","16","24"))

g2
#+end_src

#+name: binding
#+begin_src R :results output graphics :exports results :file fig/binding.pdf
orig_data = read.csv("./logs/binding_cores/total_times_orig.csv", head=TRUE, sep=',')
opt_data = read.csv("./logs/binding_cores/total_times_binding.csv", head=TRUE, sep=',')

# Speedups of Precise Mode
opt_data[, "baseline"]  <- orig_data[, "p2"]  / orig_data[, "p2"]
opt_data[, "2"]  <- orig_data[, "p2"]  / opt_data[, "p2"]
opt_data[, "4"]  <- orig_data[, "p4"]  / opt_data[, "p4"]
opt_data[, "8"]  <- orig_data[, "p8"]  / opt_data[, "p8"]
opt_data[, "16"] <- orig_data[, "p16"] / opt_data[, "p16"]
opt_data[, "24"] <- orig_data[, "p24"] / opt_data[, "p24"]
keep <- c("nodes", colnames(opt_data)[grep("^[1-9]", colnames(opt_data))], "baseline")
speedup_precise <- opt_data[keep]

df2 <- melt(speedup_precise ,  id = 'nodes', variable_name = 'threads')

g2<-ggplot(df2, aes(x=nodes,y=value, group=threads, colour=threads)) + geom_line() +
    scale_fill_hue() + theme(axis.text.x = element_text(angle = -45, hjust = 0),
    panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(),
    panel.background = element_blank(), axis.line=element_line(),
    legend.position="right") + 
    scale_y_continuous(breaks=c(1.0,1.5,2.0,2.5,3.0,4.0)) + ylab("Speedup") +
    xlab("Amount of nodes simulated")
g2
#+end_src

#+name: evol-threshold
#+begin_src R :results output graphics :exports results :file fig/evol-threshold.pdf
data = read.table("./logs/threshold/logs/thresh2_10000_threads4_precise.log", head=TRUE, sep=',')

ggplot(data, aes(x=row(data),y=X2)) + geom_line() +
    theme(axis.text.x = element_text(angle = -45, hjust = 0),
    panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(),
    panel.background = element_blank(), axis.line=element_line(),
    legend.position="right") + xlab("Id of Scheduling Round") + ylab("Value of parallel threshold")
#+end_src


#+name: all-enabled
#+begin_src R :results output graphics :exports results :file fig/all-enabled.pdf
orig_data = read.csv("./logs/all_optimized/total_times_orig.csv", head=TRUE, sep=',')
opt_data = read.csv("./logs/all_optimized/total_times_all.csv", head=TRUE, sep=',')

# Speedups of Precise Mode
opt_data[, "baseline"]  <- orig_data[, "p1"]  / orig_data[, "p1"]
opt_data[, "2"]  <- orig_data[, "p1"]  / opt_data[, "p2"]
#opt_data[, "4"]  <- orig_data[, "p1"]  / opt_data[, "p4"]
#opt_data[, "8"]  <- orig_data[, "p1"]  / opt_data[, "p8"]
#opt_data[, "16"] <- orig_data[, "p1"] / opt_data[, "p16"]
#opt_data[, "24"] <- orig_data[, "p1"] / opt_data[, "p24"]
keep <- c("nodes", colnames(opt_data)[grep("^[1-9]", colnames(opt_data))], "baseline")
speedup_precise <- opt_data[keep]

df2 <- melt(speedup_precise ,  id = 'nodes', variable_name = 'threads')

g2<-ggplot(df2, aes(x=nodes,y=value, group=threads, colour=threads)) + geom_line() +
    scale_fill_hue() + theme(axis.text.x = element_text(angle = -45, hjust = 0),
    panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(),
    panel.background = element_blank(), axis.line=element_line(),
    legend.position="right") + ylab("Speedup") +
    xlab("Amount of nodes simulated")
g2
#+end_src

#+RESULTS: all-enabled
[[file:fig/all-enabled.pdf]]


* Emacs Setup                                                      :noexport:
  This document has local variables in its postembule, which should
  allow org-mode to work seamlessly without any setup. If you're
  uncomfortable using such variables, you can safely ignore them at
  startup. Exporting may require that you copy them in your .emacs.

# Local Variables:
# eval:    (org-babel-do-load-languages 'org-babel-load-languages '( (sh . t) (R . t) (perl . t) (ditaa . t) ))
# eval:    (setq org-confirm-babel-evaluate nil)
# eval:    (setq org-alphabetical-lists t)
# eval:    (setq org-src-fontify-natively t)
# eval:    (unless (boundp 'org-latex-classes) (setq org-latex-classes nil))
# eval:    (add-to-list 'org-latex-classes 
#                       '("sigalt" "\\documentclass{sig-alternate}"  ("\\section{%s}" . "\\section*{%s}") ("\\subsection{%s}" . "\\subsection*{%s}")))
# eval:    (add-hook 'org-babel-after-execute-hook 'org-display-inline-images) 
# eval:    (add-hook 'org-mode-hook 'org-display-inline-images)
# eval:    (add-hook 'org-mode-hook 'org-babel-result-hide-all)
# eval:   (setq org-babel-default-header-args:R '((:session . "org-R")))
# eval:   (setq org-export-babel-evaluate nil)
# eval:   (setq org-latex-to-pdf-process '("pdflatex -interaction nonstopmode -output-directory %o %f ; bibtex `basename %f | sed 's/\.tex//'` ; pdflatex -interaction nonstopmode -output-directory  %o %f ; pdflatex -interaction nonstopmode -output-directory %o %f"))
# eval:   (setq ispell-local-dictionary "american")
# eval:    (setq org-export-latex-table-caption-above nil)
# eval:   (eval (flyspell-mode t))
# End: