#+TITLE: Parallel and Distributed Simulation of Large-Scale Distributed Applications #+AUTHOR: Ezequiel Torti Lopez, Martin Quinson #+EMAIL: ret0110@famaf.unc.edu.ar, martin.quinson@loria.fr #+STARTUP: indent hideblocks #+TAGS: noexport(n) #+EXPORT_SELECT_TAGS: export #+EXPORT_EXCLUDE_TAGS: noexport #+PROPERTY: session *R* #+LATEX_class: sigalt #+LATEX_HEADER: \usepackage[T1]{fontenc} #+LATEX_HEADER: \usepackage[utf8]{inputenc} #+LATEX_HEADER: \usepackage{ifthen,figlatex} #+LATEX_HEADER: \usepackage{longtable} #+LATEX_HEADER: \usepackage{float} #+LATEX_HEADER: \usepackage{wrapfig} #+LATEX_HEADER: \usepackage{subfigure} #+LATEX_HEADER: \usepackage{xspace} #+LATEX_HEADER: \usepackage[american]{babel} #+LATEX_HEADER: \usepackage{url}\urlstyle{sf} #+LATEX_HEADER: \usepackage{amscd} #+LATEX_HEADER: \usepackage{wrapfig} #+LATEX_HEADER: \usepackage{algorithm} #+LATEX_HEADER: \usepackage[noend]{algpseudocode} #+LATEX_HEADER: \renewcommand{\algorithmiccomment}[1]{// #1} #+LATEX_HEADER: \makeatletter #+LATEX_HEADER: \addto\captionsenglish{\renewcommand{\ALG@name}{Heuristic}} #+LATEX_HEADER: \makeatother #+LATEX_HEADER: \usepackage{caption} #+LATEX_HEADER: \DeclareCaptionLabelFormat{alglabel}{\bfseries\csname ALG@name\endcsname:} #+LATEX_HEADER: \captionsetup[algorithm]{labelformat=alglabel} * Motivation and Problem Statement Simulation is the third pillar of science, allowing to study complicated phenomenons through complex models. When the size or complexity of the studied models becomes too large, it is classical to leverage more resources through Parallel Discrete-Event Simulation (PDES). Still, the parallel simulation of very fine grained applications deployed on large-scale distributed systems (LSDS) remains challenging. As a matter of fact, most simulators of Peer-to-Peer systems are sequential, despite the vast literature on PDES over the last three decades. dPeerSim is one of the very few existing PDES for P2P systems, but it presents deceiving performance: it can achieve a decent speedup when increasing the amount of logical processes (LP): from 4h with 2 LPs down to 1h with 16 LPs. But it remains vastly inefficient when compared to sequential version of PeerSim, that performs the same experiment in 50 seconds only. This calls for a new parallel schema specifically tailored to this category of Discrete Event Simulators. Discrete Event Simulation of Distributed Applications classically alternates between simulation phases where the models compute the next event date, and phases where the application workload is executed. We proposed in~\cite{previous} to not split the simulation model across several computing nodes, but instead to keep the model sequential and execute the application workload in parallel when possible. We hypothesized that this would help reducing the synchronization costs. We evaluate our contribution with very fine grained workloads such as P2P protocols. These workloads are the most difficult to execute efficiently in parallel because execution times are very short, making it very difficult to amortize the synchronization times. We implemented this parallel schema within the SimGrid framework \cite{simgrid}, and showed that the extra complexity does not endangers the performance since the sequential version of SimGrid still outperforms several competing solutions when our addition are present but disabled at run time. To the best of our knowledge, it is the first time that a parallel simulation of P2P system proves to be faster that the best known sequential execution. Yet, the parallel simulation only outperforms sequential one when the amount of processes becomes large enough. This is because of the pigonhole principle: when the amount of processes increases, the average amount of processes that are ready to run at each simulated timestamp (and can thus run in parallel) increases. When simulating the Chord protocol, it takes 500,000 processes or more to amortizing the synchronization costs, while the classical studies of the literature usually involve less processes. The current work aims at further improving the performance of our PDES, using several P2P protocols as a workload. We investigate the possible inefficiency and propose generic solutions that could be included in other similar simulators of large-scale distributed systems, be them P2P simulators of cloud, HPC or sensornets ones. This paper is organized as follows: Section \ref{sec:context} recaps the SimGrid architecture and quickly presents the parallel execution schema detailed in \cite{previous}. Section \ref{sec:problem} analysis the current performance of our model. Section \ref{sec:parallel} explores several trade-offs for the efficiency of the parallel sections. Section \ref{sec:adaptive} proposes an algorithm to automatically tune the level of parallelism that is adapted to the simulated application. Section \ref{sec:cc} concludes this paper and discusses some future work. # the theoretical performance bound, and discusses the previous work at the light of the Amhdal law * Context #+LaTeX: \label{sec:context} In the previous work \cite{previous} we proposed to parallelize the execution of the user code while keeping the simulation engine sequential. This is enabled by applying classical concepts of OS design to this new context: every interaction between the user processes (from now on, user processes and processes mean the same thing) and the simulated environment passes through a specific layer that act as an OS kernel. A novel way to virtualize user processes (\emph{raw contexts}) was crafted to improve efficiency and avoid unnecesary system calls, but other ways to do this can be found for the sake of portability, such as full featured threads, or POSIX \emph{ucontexts}. A new data structure to store the shared state of the system and synchronize the process execution was implemented as well (\emph{parmap}). The new layer acting as the OS kernel was implemented in SimGrid to emulate systems calls, called \emph{requests}. Each time a process want to interact with other process, or the engine itself, it raises a \emph{request}. After what is called a 'Scheduling Round' (SR), all the active processes have raised their request and wait for a response, or have finished their work. Then the engine takes control of the program and answer sequentially the \emph{requests} of each process. This way the user processes can be parallelized in a safe manner. More details on the execution model can be found in \cite{previous} and the the official website of SimGrid \cite{simgrid}. Experimental results showed that the new design does not hinder the tool scalability, and even the sequential version is more scalable than state of the art simulators. The difficulty to get a parallel version of a P2P simulator faster than its sequential counterpart was also revealed in \cite{previous}, being the first time that a parallel simulation of Chord runs faster than the best known sequential implementation. Another interesting result showed in the previous work is that the speedups only increased up to a certain point when increasing the amount of working threads. We also have proved that for small instances, parallelism actually hinders the performance, and that the relative gain of parallelism seems even strictly increasing with the system size. Now we are closer to the optimal Amdahl's law threshold, that means that we have reach a limit on the parallelizable portions of the code in our proposed model. The remaining optimizations look for a final speedup, trying to get a better parallel threshold dynamically depending on the simulation, and better performance of the threads taking in count their distribution on the CPU cores and the different synchronization modes (futex, POSIX primitives or busy waiters). All the experiments were run using the facilities provided by Grid'5000 \cite{g5k}. * Performance Analysis #+LaTeX: \label{sec:problem} ** Current speedup achieved # Also, the benchmarking not intrusive is here. To get baseline timings and a speedup plot starting from the development version of SimGrid (3.12), benchmarks to measure the execution time in Precise mode with different amount of threads (1, 2, 4, 8, 16 and 24) were done. For this we used an implementation of the well known Chord protocol \cite{chord} as workload. The absolute times of a normal execution for the Chord simulation are presented in the table \ref{tab:one}. #+caption: Execution times of a normal execution of Chord with different sizes, serial and with 2 and 8 threads. The average memory consumption is reported in GB. #+name: tab:one |---+-------+---------+-------+---------+-------+---------+-------| | | nodes | serial | Mem | 2 thr. | Mem. | 8 thr. | Mem. | | / | <> | < | > | < | > | < | > | |---+-------+---------+-------+---------+-------+---------+-------| | # | 10k | 0:01:03 | 0.25 | 0:01:20 | 0.26 | 0:01:35 | 0.25 | | # | 50k | 0:06:20 | 1.24 | 0:07:39 | 1.27 | 0:08:03 | 1.25 | | # | 100k | 0:13:34 | 2.47 | 0:15:36 | 2.53 | 0:15:50 | 2.50 | | # | 300k | 0:50:58 | 7.38 | 0:55:18 | 7.54 | 0:57:55 | 7.47 | | # | 500k | 1:38:16 | 12.30 | 1:34:15 | 12.47 | 1:35:10 | 12.45 | | # | 1m | 4:05:41 | 24.53 | 4:00:42 | 24.89 | 3:47:28 | 24.91 | |---+-------+---------+-------+---------+-------+---------+-------| As it can be seen in Figure \ref{fig:one.one}, the memory consumption linearly increases with respect to the number of simulated nodes, and shows that each node is using around 25 KB and 30 KB of memory. A simulation with 1000 nodes, has a peak memory consumption around 30 MB (regardless of the amount of threads launched) and finishes in 4 seconds in a serial execution, and one with 1000000 nodes takes 24-25GB of memory and 3h47m to finish in the best case (parallel execution with 8 threads). #+attr_latex: width=0.8\textwidth,placement=[p] #+label: fig:one.one #+caption: Memory consumptions reported in GB [[file:fig/memory-consumption.pdf]] The actual speedup obtained can be seen in the Figure \ref{fig:one}. It is clear from that graph that the real speedup with our parallel model is obtained when the size of the problem is bigger than 300000 nodes. This confirms what was proved in \cite{previous}. Figure \ref{fig:one} also shows that increasing the number of threads may not be the best option to increase performance, since the best speedups are achieved with 2,4 and 8 threads. Some of the optimizations proposed in section \ref{sec:parallel} show improvements over the original versions with 16 and 24 threads, but their total times are still behind the ones of the same simulations with lesser amount of threads. #+attr_latex: width=0.8\textwidth,placement=[p] #+label: fig:one #+caption: Baseline performance of SimGrid 3.11. Speedups achieved using multithreaded executions against the sequential ones. [[file:fig/baseline-perf.pdf]] # We want to see now is how far are we from the ideal speedup that # would be achieved according to the Amdahl law. For that, a benchmark # test is run to get the timings of the sequential and parallel parts of # the executions, and the calculate that speedup using the Amdahl # equation. # But first we want to prove that our benchmarks are not intrusive, that # is, our measures of parallel and sequential times do not really affect # the overall performance of the system. For that, the experiments are # run with and without benchmarking, using the Precise mode, and then a # comparison of both is made to find if there is a significative breach # in the timings of both experiments. # Using the Chord simulation, the experiment showed us that the maximum # difference in the execution time of both versions is lesser than 10% # in most of the cases, and is even lower with sizes bigger than 100000 # nodes, which allow us to conclude the benchmarking is, indeed, not # intrusive. ** Parallelizable portions of the problem We want to analyze each SR and find any possible performance problem here, since is the portion of code that is run in parallel in our model. Using the same Chord implementation as workload, we want to gather the following data: ID of each Scheduling Round, time taken by each Scheduling Round and number of process executed in each scheduling round. As it can be seen in the Figure \ref{fig:two}, the amount of SR's having just one process varies between 26% and 48% (the larger the simulated size, the lower the amount of SR's that have only one process) while the others involve two or more processes. These remaining processes are executed in parallel due to the parallel execution threshold already setted up in SimGrid (which can be modified trough a parameter). #+attr_latex: width=0.8\textwidth,placement=[p] #+label: fig:two #+caption: Proportions of SR's having different numbers of processes to compute; according to the size of nodes simulated. [[file:fig/sr-distribution.pdf]] However, launching a small amount of processes is inefficient due to the synchronization costs of threads. Even when Figure \ref{fig:three} shows that the bigger the amount of processes in a SR, the bigger the execution time, there is no speedup obtained from executing small amounts of processes in parallel, as we will see in Section \ref{sec:adaptive}. Hence, it would be convenient to know, during a simulation, when to launch SR in parallel and when to do it sequentially. A heuristic to accomplish that is proposed later in this document. #+attr_latex: width=0.8\textwidth,placement=[p] #+label: fig:three #+caption: Average times of sequential executions of SR's depending on the amount of processes of each SR. [[file:fig/sr-times.pdf]] * Optimizations #+LaTeX: \label{sec:parallel} ** Binding threads to physical cores Regarding the multicore architectures (like almost every modern CPU), parallelization through threads is well proved to be a good choice if done correctly. This approach, used currently by SimGrid, showed a good gain in speed with bigger sizes, as we said in Section \cite{sec:problem} But there are still improvements that might reduce the noise and the overhead that inherently comes with threads. Thread execution depends heavily on the operative system scheduler: when one thread is \emph{idle}, the scheduler may decide to switch it for another thread ready to work, so it can maximize the occupancy of the CPU cores, and probably, run a program in a faster way. Or it may just want to switch threads because their execution time quote is over. When the first thread is ready to work again, the CPU core where it was before might be occupied, forcing the system to run the thread in another core. Regardless of the situation, or the scheduler we are using, the general problem remains: increasing the CPU migrations of threads can be detrimental for the performance. In order to avoid these CPU migrations produced by a constant context switching of threads, GLibc \cite{glibc} offers a way to bind each thread to a physical core of the CPU. Note that this is only available in Linux platforms. A Chord simulation was run in a parapluie node with 24 cores, binding the threads to physical cores. The CPU migration was drastically reduced (almost 97\% less migrations) in all the cases, but the relative speedup was not significant: always lower than x1.5, regardless the amount of threads/sizes. However, the bigger speedups were obtained with sizes less than 100000 nodes, which allow us to conclude that CPU migrations should be avoided when the simulation is small enough, since they introduce an unwanted overhead. ** Parmap between N cores Several optimizations regarding the distribution of work between threads were proposed: the first option is the default one, where maestro works with its threads and the processes are distributed equitably between each thread; the second one is to send maestro to sleep and let the worker threads do all the computing; the last one involves the creation of one extra thread and make all this N threads work while maestro sleeps. The experiments showed that no performance gain was achieved. In fact, the creation of one extra thread proved to be slower than the original version of parmap, while sending maestro to sleep and make its N-1 threads do the computation did not show any improvement or loss in performance. ** Busy Waiting versus Futexes SimGrid provides several types of synchronization between threads: Fast Userspace Mutex (futex), the classical POSIX synchronization primitives and busy waiters. While each of them can be chosen when running the simulation, futexes are the default option, since they have the advantage to implement a fast synchronization mode within the parmap abstraction, in user space only. But even when they are more efficient than classical mutexes (which run in kernel space), they may present performance drawbacks that inherently come with synchronization costs. In this section we compare busy waiters and futexes performances, using the Chord example. As it can be seen in Figure \ref{fig:four}, the gain in speed is immediate with small sizes: the elimination of any synchronization call makes the simulation run up to 2 times faster. However, we can see the performance drop and match the one achieved with futexes with bigger sizes. #+attr_latex: width=0.8\textwidth,placement=[p] #+label: fig:four #+caption: Relative speedup of busy waiters vs. futexes in Chord simulation. [[file:fig/busy.pdf]] * Optimal threshold for parallel execution #+LaTeX: \label{sec:adaptive} ** Getting a real threshold over simulations Plus the optimization of The threshold wanted is how many processes are the right amount to be executed in parallel when it is necessary, and when is it better to execute them in a sequential way. Initially, what we want is to find an optimal threshold for the beginning of any simulation. For that purpose, we have done a benchmark to get each SR execution time for both parallel and serial executions, and calculated the speedup obtained in each SR. A typical run using Chord and with 10000 examples shows that after 500 processes per SR, the speedup is always bigger than one. It is interesting to note that even for simulations with different sizes the similar limit is reached. Analyzing the data thoroughly tell us that the 83\% of SR's with processes between 250 and 300 show a speedup. In consequence, 250 processes will be our base threshold for parallel execution, and the adaptive algorithm proposed in next section will be in charge of increasing or decreasing that threshold according to the needs and characteristics of the simulation. In Figure \ref{fig:five} we can see the example with 10000 nodes simulated. Although it seems there is an important amount of SR with less than 250 processes that are faster in parallel, they represent only the 5\% of that subset of SR's. The remaining 95\% of SR's with less than 500 processes showed speedup equal or less than 1. #+attr_latex: width=0.8\textwidth,placement=[p] #+label: fig:five #+caption: Speedup of parallel vs. sequential executions of SR's, depending in the number of processes taken by each SR [[file:fig/sr-par-threshold_10000.png]] ** Dynamic estimation of the optimal threshold Finding an optimal threshold and keep it during all the simulation might not always be the best option: some simulations can take more or less time in the execution of user processes. If a simulation has very efficient processes, or processes that don't work too much, then the threshold could be inappropriate, leading to parallelize scheduling rounds that would run more efficiently in a sequential way. That's why an heuristic for a dynamic threshold estimation is proposed. The main idea behind this heuristic (\ref{adaptive-algorithm}) is to calculate the optimal number of processes that can be run in parallel during the execution of the simulation. For that purpose, the time of a certain amount of scheduling round is measured. A performance ratio for the parallel and sequential executions is calculated, simply by dividing the time taken by the amount of processes computed. If the sequential ratio turns to be bigger than the parallel one, then the threshold is decreased, and increased otherwise. A naive implementation of this heuristic, showed a small relative improvement in performance. The times were certainly reduced with small sizes, since it chooses to execute the majority of the processes sequentially, while with bigger sizes (more than 100000 nodes), the speedup is insignificant. In terms of absolute times, we can see that the execution times have been slightly reduced (up to ten minutes less in a one million nodes simulation in the best case, with 8 threads). This improvements may be small due to the fact that we are calculating the ratio with the times of the latest SR's, and in consequence, using values that may not represent the general situation. A new approach, using a cumulative ratio (calculated during all the simulation) instead of the one computed with the latest values, proved better in terms of performance. This approach also changes the way we do the timings: instead of benchmarking the SR's each time, we benchmark the SR's that have certain amount of processes, limited by an upper limit for parallel execution and a lower limit for the sequential ones. This is a way to prevent the timing of extreme cases (very big or very small number of processes) which may introduce errors in the estimation of the threshold, and acts like a 'window' to filter the cases we are interested in. These limits are calculated along the simulation with the average amount of processes that have been run so far in parallel (or serial), plus the standard deviation. Since we never know beforehand the amount of SR's we will have, the average and the standard deviation are computed using the algorithm of Welford \cite{acsvar,csacsmv}. When a prefixed amount of parallel and sequential SR's have been run, we proceed to update the threshold applying a similar rule of thumb: if the sequential executions were better and we have a bigger number of processes than the corresponding average, we increase the threshold, giving a chance to the serial executions to prove they are better. Otherwise, if the parallel executions performed better and the number of processes of the current SR is smaller than the average, we decrease the threshold. This new implementation proved to be faster than the original parallel version with sizes under 300000 nodes, while with bigger amount of nodes the speedup remains almost the same. It also avoids the increase of the threshold to unrealistic values (which may happen in the naive version, due to fact that we have a lot of SR's with small amount of processes that are computed sequentially and the fact that we increase the threshold each time a sequential execution performs better than a parallel). All the experiments were performed setting the initial threshold to 250 processes, which was estimated as an optimal starting threshold in previous section. The heuristic lead to different final thresholds depending on the initial one, of course, since the SR's launched in parallel will not be the same from the beginning. However, experiments showed that it behaves quite stable, and there is a tendency to increase/decrease the threshold in the same simulation regardless the one at the beginning. #+begin_latex \begin{algorithm} \caption{Adaptive Threshold}\label{adaptive-algorithm} \begin{algorithmic} \State \Comment {Amount of parallel/sequential SRs that ran} \State $parallel\_SRs, sequential\_SRs \gets \textit{1}$ \State \Comment {Sum of times of par/seq SR's} \State $seq\_time, par\_time \gets \textit{0}$ \State \Comment {Number of processes computed in par/seq} \State $process\_seq, process\_par \gets \textit{0}$ \State \Comment {Average amount of processes parallel/sequential} \State avg\_par\_proc, avg\_seq\_proc \State \Comment {Standard deviation of processes parallel/sequential} \State sd\_seq\_proc, sd\_par\_proc \State \Procedure{RunSchedulingRound}{} \If {computed five par/seq SR's} \State $ratio\_seq \gets seq\_time/process\_seq$ \State $ratio\_par \gets par\_time/process\_par$ \State $sequential\_is\_slower \gets ratio\_seq>ratio\_par$ \If {$sequential\_is\_slower$} \If {$processes\_to\_run < avg\_par\_proc$} \State decrease($parallel\_threshold$) \EndIf \Else \If {$processes\_to\_run > avg\_seq\_proc$} \State increase($parallel\_threshold$) \EndIf \EndIf \EndIf \State \If {$processes\_to\_run >= parallel\_threshold$} \If {$processes\_to\_run < par\_window$} \State $parallel\_SRs++$ \State start($timer$) \State execute\_SR\_parallel() \State stop($timer$) \State $par\_time \gets par\_time + $elapsed($timer$) \State $process\_par \gets process\_par + processes\_to\_run$ \State $avg\_par\_proc \gets $calculate\_current\_avg\_of\_par\_processes() \State $sd\_par\_proc \gets $calculate\_current\_sd\_of\_par\_processes() \State $par\_windows = avg\_par\_proc + sd\_par\_proc$ \Else \State execute\_SR\_parallel() \EndIf \Else \If {$processes\_to\_run < seq\_window$} \State $sequential\_SRs++$ \State start($timer$) \State execute\_SR\_serial() \State stop($timer$) \State $seq\_time \gets seq\_time + $elapsed($timer$) \State $process\_seq \gets process\_seq + processes\_to\_run$ \State $avg\_seq\_proc \gets $calculate\_current\_avg\_of\_seq\_processes() \State $sd\_seq\_proc \gets $calculate\_current\_sd\_of\_seq\_processes() \State $seq\_windows = avg\_seq\_proc - sd\_seq\_proc$ \Else \State execute\_SR\_serial() \EndIf \EndIf \EndProcedure \end{algorithmic} \end{algorithm} #+end_latex #+attr_latex: width=0.8\textwidth,placement=[p] #+label: fig:six #+caption: Speedups achieved with Adaptive threshold heuristic. Chord simulation. [[file:fig/adapt-algorithm.pdf]] Regarding the memory consumption, the values remain the same in general, as it can be seen in Table \ref{tab:two}. #+caption: Execution times (seconds) of the Adaptive threshold heuristic, with 2,4 and 8 threads. The average memory consumption is reported in GB. #+name: tab:two |---+-------+---------+-------+---------+-------+---------+-------| | | nodes | 2 thr. | Mem | 4 thr. | Mem | 8 thr. | Mem | | / | <> | < | > | < | > | < | > | |---+-------+---------+-------+---------+-------+---------+-------| | # | 10k | 0:01:19 | 0.26 | 0:01:20 | 0.26 | 0:01:27 | 0.25 | | # | 50k | 0:07:21 | 1.27 | 0:07:28 | 1.27 | 0:07:30 | 1.26 | | # | 100k | 0:15:16 | 2.53 | 0:15:04 | 2.55 | 0:14:48 | 2.51 | | # | 300k | 0:54:48 | 7.55 | 0:54:05 | 7.52 | 0:53:44 | 7.48 | | # | 500k | 1:38:52 | 12.47 | 1:35:19 | 12.56 | 1:31:50 | 12.45 | | # | 1m | 3:59:12 | 24.89 | 3:47:22 | 25.19 | 3:37:12 | 24.91 | |---+-------+---------+-------+---------+-------+---------+-------| * Conclusion #+LaTeX: \label{sec:cc} We have shown in this work several ways to optimize large scale distributed simulations in a specific framework, namely, binding threads to physical cores, choosing a better threshold for parallel execution or choosing between different synchronization modes between threads. The optimizations were done over the open-source multi-purpose SimGrid simulation framework, in its development version (3.12). Some of the changes proposed worked in some scenarios better than others (for instance, the binding threads to cores optimization showed a real speedup in simulations using bigger amount of threads, such as 16 or 24, while using busy waiters proved to be better than futexes in simulations with small sizes and small amount of threads). Also, some of the modifications did not affect the overall performance, or even made it worst, like the parmap changes proposed in Section \ref{sec:parallel}. Most of the changes proposed gained performance with small sizes simulations (under 300000 nodes), but remained the almost the same with larger ones, showing the difficulty of optimizing a complex multi-threaded system. We certainly arrived to a point where optimization depends heavily on reducing the synchronization costs and playing with low level features of the code. An intelligent choice of when to launch processes in parallel and when to do it in a serial way proved to help with small cases but it was unnecessary with bigger ones, where there is already speedup achieved using threads to simulate user processes. In a final note, the present work was done with the reproducible research approach in mind. Hence, the steps and scripts needed to run the experiment can be found in the appendix section. * Acknowledgments Experiments presented in this paper were carried out using the Grid'5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several Universities as well as other funding bodies (see https://www.grid5000.fr). #+LaTeX: \bibliographystyle{abbrv} #+LaTex: \bibliography{report} #+LaTeX: \onecolumn #+LaTeX: \appendix * Data Provenance This section explains and show how to run the experiments and how the data is saved and then processed. Note: that all experiments are run using the Chord simulation that can be found in \texttt{examples/msg/chord} folder of your SimGrid install. Unless stated, all the experiments are run using the futex synchronization method and raw contexts under a Linux environment; in a 'parapluie' node at Grid5000. The analysis of data can be done within this paper itself, executing the corresponding R codes. Note that it is even possible to execute them remotely if TRAMP is used to open this file (this is useful if you want the data to be processed in one powerful machine, such as a cluster). ** Modifiable Parameters Some of the parameters to run the experiments can be modified, like the amount of nodes to simulate and the amount of threads to use. Note that the list of nodes to simulate have to be changed in both the python session and the shell session. This sessions are intended to last during all your experiments/analysis. This sizes/threads lists are needed to run the simulations, generate platform/deployment files, and generate tables after the experiments. Hence, is mandatory to run this snippets. Bash: #+begin_src sh :session org-sh BASE_DIR=$PWD sizes=(1000 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000 70000 75000 80000 85000 90000 95000 100000 300000 500000 1000000) threads=(1 2 4 8 16 24) #+end_src Python: #+name: set_python_args #+begin_src python :session SIZES = [1000] SIZES += [elem for elem in range(5000,100000,5000)] SIZES += [100000,300000,500000,1000000] THREADS = [1, 2, 4, 8, 16, 24] # All the benchmarks can be done using both modes, but note that this # paper uses only precise MODES = ['precise'] nb_bits = 32 end_date = 10000 #+end_src ** Setting up the machine Install required packages to compile/run SimGrid experiments. If you are in a cluster (such as Grid5000) you can run this file remotely in a deployed node and still be able to setup your environment. Run this two code chunks one after other in order to create folders, install packages and create required deployment/platform files. If the [[setup\_and\_install]] snippet was run before, or everything is already installed and set up, then check/modify the parameters of the shell session with the snippets [[check\_args]] and [[go\_to\_chord]] \texttt{setup\_and\_install}: #+name: setup_and_install #+begin_src sh :session org-sh # Save current directory where the report is BASE_DIR=$PWD apt-get update && apt-get install cmake make gcc git libboost-dev libgct++ libpcre3-dev linux-tools gdb liblua5.1-0-dev libdwarf-dev libunwind7-dev valgrind libsigc++ mkdir -p SimGrid deployment platforms logs fig cd $BASE_DIR/SimGrid/ # Clone latest SimGrid version. You may have to configure proxy settings if you are in a G5K node in order to clone this git repository git clone https://gforge.inria.fr/git/simgrid/simgrid.git . SGPATH='/usr/local' # Save the revision of SimGrid used for the experiment SGHASH=$(git rev-parse --short HEAD) cmake -Denable_compile_optimizations=ON -Denable_supernovae=OFF -Denable_compile_warnings=OFF -Denable_debug=OFF -Denable_gtnets=OFF -Denable_jedule=OFF -Denable_latency_bound_tracking=OFF -Denable_lua=OFF -Denable_model-checking=OFF -Denable_smpi=OFF -Denable_tracing=OFF -Denable_documentation=OFF . make install cd ../../ #+end_src \texttt{generate\_platform\_files}: #+name: generate_platform_files #+begin_src python :session :results output # This function generates a specific platform file for the Chord example. import random def platform(nb_nodes, nb_bits, end_date): max_id = 2 ** nb_bits - 1 all_ids = [42] res = ["\n" "\n"] res.append("\n"%(nb_nodes, nb_bits, end_date)) res.append("\n" " \n" % end_date) for i in range(1, nb_nodes): ok = False while not ok: my_id = random.randint(0, max_id) ok = not my_id in all_ids known_id = all_ids[random.randint(0, len(all_ids) - 1)] start_date = i * 10 res.append(" \n" % (i, my_id, known_id, start_date, end_date)) all_ids.append(my_id) res.append("") res = "".join(res) f = open(os.getcwd() + "/platforms/chord%d.xml"%nb_nodes, "w") f.write(res) f.close() return # This function generates a specific deployment file for the Chord example. # It assumes that the platform will be a cluster. def deploy(nb_nodes): res = """ """%(nb_nodes-1) f = open(os.getcwd() + "/deployment/One_cluster_nobb_%d_hosts.xml"%nb_nodes, "w") f.write(res) f.close() return # Remember that SIZES was defined as a global variable in the first python code chunk in [[Modifiable Parameters]] for size in SIZES: platform(size, nb_bits, end_date) deploy(size) #+end_src Optional snippets to check arguments and go to chord folder: \texttt{check\_args}: #+name: check_args #+begin_src sh :session org-sh echo $sizes echo $threads echo $BASE_DIR #sizes=(1000) #threads=(1 2) #BASE_DIR=$PWD echo $sizes echo $threads echo $BASE_DIR #+end_src \texttt{go\_to\_chord}: #+name: go_to_chord #+begin_src sh :session org-sh cd $BASE_DIR/SimGrid/examples/msg/chord echo $BASE_DIR echo $sizes echo $threads make #+end_src ** Scripts to run benchmarks This are general scripts that can be used to run all the benchmarks after the proper modifications were done. \texttt{testall}: #+name: testall #+begin_src sh :var SG_PATH='/usr/local' :var log_folder="logs" :session org-sh # This script is to benchmark the Chord simulation that can be found # in examples/msg/chord folder. # The benchmark can be done with both Constant and Precise mode, using # different sizes and number of threads (which can be modified). # This script also generate a table with all the times gathered, that can ease # the plotting, compatible with gnuplot/R. # By now, this script copy all data (logs generated an final table) to a # personal frontend-node in Grid5000. This should be modified in the near # future. ############################################################################### # MODIFIABLE PARAMETERS: SGPATH, SGHASH, sizes, threads, log_folder, file_table # host_info, timefmt, cp_cmd, dest. # Path to installation folder needed to recompile chord # If it is not set, assume that the path is '/usr/local' if [ -z "$SG_PATH" ] then SGPATH='/usr/local' fi # Save the revision of SimGrid used for the experiment SGHASH=$(git rev-parse --short HEAD) # List of sizes to test. Modify this to add different sizes. if [ -z "$sizes" ] then sizes=(1000 3000) fi # Number of threads to test. if [ -z "$threads"] then threads=(1 2 4 8 16 24) fi # Path where to store logs, and filenames of times table, host info if [ -z "$log_folder"] then log_folder=$BASE_DIR"/logs" else log_folder=$BASE_DIR"/logs/"$log_folder fi if [ ! -d "$log_folder" ] then echo "Creating $log_folder to store logs." mkdir -p $log_folder fi # Copy all the generated deployment/platform files into chord folder cp $BASE_DIR/platforms/* . cp $BASE_DIR/deployment/* . file_table="timings_$SGHASH.csv" host_info="host_info.org" rm -rf $host_info # The las %U is just to ease the parsing for table timefmt="clock:%e user:%U sys:%S telapsed:%e swapped:%W exitval:%x max:%Mk avg:%Kk %U" # Copy command. This way one can use cp, scp and a local folder or a folder in # a cluster. sep=',' cp_cmd='cp' dest=$log_folder"/." # change for @.grid5000.fr:~/$log_folder if necessary ############################################################################### ############################################################################### echo "Recompile the binary against $SGPATH" export LD_LIBRARY_PATH="$SGPATH/lib" rm -rf chord gcc chord.c -L$SGPATH/lib -I$SGPATH/include -I$SGPATH/src/include -lsimgrid -o chord if [ ! -e "chord" ]; then echo "chord does not exist" exit; fi ############################################################################### ############################################################################### # PRINT HOST INFORMATION IN DIFFERENT FILE set +e echo "#+TITLE: Chord experiment on $(eval hostname)" >> $host_info echo "#+DATE: $(eval date)" >> $host_info echo "#+AUTHOR: $(eval whoami)" >> $host_info echo " " >> $host_info echo "* People logged when experiment started:" >> $host_info who >> $host_info echo "* Hostname" >> $host_info hostname >> $host_info echo "* System information" >> $host_info uname -a >> $host_info echo "* CPU info" >> $host_info cat /proc/cpuinfo >> $host_info echo "* CPU governor" >> $host_info if [ -f /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor ]; then cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor >> $host_info else echo "Unknown (information not available)" >> $host_info fi echo "* CPU frequency" >> $host_info if [ -f /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq ]; then cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq >> $host_info else echo "Unknown (information not available)" >> $host_info fi echo "* Meminfo" >> $host_info cat /proc/meminfo >> $host_info echo "* Memory hierarchy" >> $host_info lstopo --of console >> $host_info echo "* Environment Variables" >> $host_info printenv >> $host_info echo "* Tools" >> $host_info echo "** Linux and gcc versions" >> $host_info cat /proc/version >> $host_info echo "** Gcc info" >> $host_info gcc -v 2>> $host_info echo "** Make tool" >> $host_info make -v >> $host_info echo "** CMake" >> $host_info cmake --version >> $host_info echo "* SimGrid Version" >> $host_info grep "SIMGRID_VERSION_STRING" ../../../include/simgrid_config.h | sed 's/.*"\(.*\)"[^"]*$/\1/' >> $host_info echo "* SimGrid commit hash" >> $host_info git rev-parse --short HEAD >> $host_info $($cp_cmd $host_info $dest) ############################################################################### ############################################################################### # ECHO TABLE HEADERS INTO FILE_TABLE rm -rf $file_table tabs_needed="" for thread in "${threads[@]}"; do thread_line=$thread_line"\t"$thread done thread_line=$thread_line$thread_line for size in $(seq 1 $((${#threads[@]}-1))); do tabs_needed=$tabs_needed"\t" done echo "#SimGrid commit $SGHASH" >> $file_table echo -e "#\t\tconstant${tabs_needed}precise" >> $file_table echo -e "#size/thread$thread_line" >> $file_table ############################################################################### ############################################################################### # START SIMULATION test -e tmp || mkdir tmp me=tmp/`hostname -s` for size in "${sizes[@]}"; do line_table=$size # CONSTANT MODE for thread in "${threads[@]}"; do filename="chord_${size}_threads${thread}_constant.log" rm -rf $filename if [ ! -f chord$size.xml ]; then ./generate.py -p -n $size -b 32 -e 10000 fi if [ ! -f One_cluster_nobb_${size}_hosts.xml ]; then ./generate.py -d -n $size fi echo "$size nodes, constant model, $thread threads" cmd="./chord One_cluster_nobb_"$size"_hosts.xml chord$size.xml --cfg=contexts/stack_size:16 --cfg=network/model:Constant --cfg=network/latency_factor:0.1 --log=root.thres:info --cfg=contexts/nthreads:$thread --cfg=contexts/guard_size:0" /usr/bin/time -f "$timefmt" -o $me.timings $cmd $cmd 1>/tmp/stdout-xp 2>/tmp/stderr-xp if grep "Command terminated by signal" $me.timings ; then echo "Error detected:" temp_time="errSig" elif grep "Command exited with non-zero status" $me.timings ; then echo "Error detected:" temp_time="errNonZero" else temp_time=$(cat $me.timings | awk '{print $(NF)}') fi # param cat $host_info >> $filename echo "* Experiment settings" >> $filename echo "size:$size, constant network, $thread threads" >> $filename echo "cmd:$cmd" >> $filename #stderr echo "* Stderr output" >> $filename cat /tmp/stderr-xp >> $filename # time echo "* Timings" >> $filename cat $me.timings >> $filename line_table=$line_table$sep$temp_time $($cp_cmd $filename $dest) rm -rf $filename rm -rf $me.timings done #PRECISE MODE for thread in "${threads[@]}"; do echo "$size nodes, precise model, $thread threads" filename="chord_${size}_threads${thread}_precise.log" cmd="./chord One_cluster_nobb_"$size"_hosts.xml chord$size.xml --cfg=contexts/stack_size:16 --cfg=maxmin/precision:0.00001 --log=root.thres:info --cfg=contexts/nthreads:$thread --cfg=contexts/guard_size:0" /usr/bin/time -f "$timefmt" -o $me.timings $cmd $cmd 1>/tmp/stdout-xp 2>/tmp/stderr-xp if grep "Command terminated by signal" $me.timings ; then echo "Error detected:" temp_time="errSig" elif grep "Command exited with non-zero status" $me.timings ; then echo "Error detected:" temp_time="errNonZero" else temp_time=$(cat $me.timings | awk '{print $(NF)}') fi # param cat $host_info >> $filename echo "* Experiment settings" >> $filename echo "size:$size, constant network, $thread threads" >> $filename echo "cmd:$cmd" >> $filename #stderr echo "* Stderr output" >> $filename cat /tmp/stderr-xp >> $filename # time echo "* Timings" >> $filename cat $me.timings >> $filename line_table=$line_table$sep$temp_time $($cp_cmd $filename $dest) rm -rf $filename rm -rf $me.timings done echo -e $line_table >> $file_table done $($cp_cmd $file_table $dest) rm -rf $file_table rm -rf tmp #+end_src \texttt{testall\_sr}: #+name: testall_sr #+begin_src sh :var SG_PATH='/usr/local' :var log_folder="logs" :session org-sh # This script is to benchmark the Chord simulation that can be found # in examples/msg/chord folder. # The benchmark is done with both Constant and Precise mode, using # different sizes and number of threads (which can be modified). # This script also generate a table with all the times gathered, that can ease # the plotting, compatible with gnuplot/R. # By now, this script copy all data (logs generated an final table) to a # personal frontend-node in Grid5000. This should be modified in the near # future. ############################################################################### # MODIFIABLE PARAMETERS: SGPATH, SGHASH, sizes, threads, log_folder, file_table # host_info, timefmt, cp_cmd, dest. # Path to installation folder needed to recompile chord # If it is not set, assume that the path is '/usr/local' if [ -z "$SG_PATH" ] then SGPATH='/usr/local' fi # Save the revision of SimGrid used for the experiment SGHASH=$(git rev-parse --short HEAD) # List of sizes to test. Modify this to add different sizes. if [ -z "$sizes" ] then sizes=(1000 3000) fi # Number of threads to test. if [ -z "$threads"] then threads=(1 2 4 8 16 24) fi # Path where to store logs, and filenames of times table, host info if [ -z "$log_folder"] then log_folder=$BASE_DIR"/logs" else log_folder=$BASE_DIR"/logs/"$log_folder fi if [ ! -d "$log_folder" ] then echo "Creating $log_folder to store logs." mkdir -p $log_folder fi # Copy all the generated deployment/platform files into chord folder cp $BASE_DIR/platforms/* . cp $BASE_DIR/deployment/* . file_table="timings_$SGHASH.csv" host_info="host_info.org" rm -rf $host_info # The las %U is just to ease the parsing for table timefmt="clock:%e user:%U sys:%S telapsed:%e swapped:%W exitval:%x max:%Mk avg:%Kk %U" # Copy command. This way one can use cp, scp and a local folder or a folder in # a cluster. sep=',' cp_cmd='cp' dest=$log_folder # change for @.grid5000.fr:~/$log_folder if necessary ############################################################################### ############################################################################### echo "Recompile the binary against $SGPATH" export LD_LIBRARY_PATH="$SGPATH/lib" rm -rf chord gcc chord.c -L$SGPATH/lib -I$SGPATH/include -I$SGPATH/src/include -lsimgrid -o chord if [ ! -e "chord" ]; then echo "chord does not exist" exit; fi ############################################################################### ############################################################################### # PRINT HOST INFORMATION IN DIFFERENT FILE set +e echo "#+TITLE: Chord experiment on $(eval hostname)" >> $host_info echo "#+DATE: $(eval date)" >> $host_info echo "#+AUTHOR: $(eval whoami)" >> $host_info echo " " >> $host_info echo "* People logged when experiment started:" >> $host_info who >> $host_info echo "* Hostname" >> $host_info hostname >> $host_info echo "* System information" >> $host_info uname -a >> $host_info echo "* CPU info" >> $host_info cat /proc/cpuinfo >> $host_info echo "* CPU governor" >> $host_info if [ -f /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor ]; then cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor >> $host_info else echo "Unknown (information not available)" >> $host_info fi echo "* CPU frequency" >> $host_info if [ -f /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq ]; then cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq >> $host_info else echo "Unknown (information not available)" >> $host_info fi echo "* Meminfo" >> $host_info cat /proc/meminfo >> $host_info echo "* Memory hierarchy" >> $host_info lstopo --of console >> $host_info echo "* Environment Variables" >> $host_info printenv >> $host_info echo "* Tools" >> $host_info echo "** Linux and gcc versions" >> $host_info cat /proc/version >> $host_info echo "** Gcc info" >> $host_info gcc -v 2>> $host_info echo "** Make tool" >> $host_info make -v >> $host_info echo "** CMake" >> $host_info cmake --version >> $host_info echo "* SimGrid Version" >> $host_info grep "SIMGRID_VERSION_STRING" ../../../include/simgrid_config.h | sed 's/.*"\(.*\)"[^"]*$/\1/' >> $host_info echo "* SimGrid commit hash" >> $host_info git rev-parse --short HEAD >> $host_info $($cp_cmd $host_info $dest) ############################################################################### ############################################################################### # ECHO TABLE HEADERS INTO FILE_TABLE rm -rf $file_table tabs_needed="" for thread in "${threads[@]}"; do thread_line=$thread_line"\t"$thread done thread_line=$thread_line$thread_line for size in $(seq 1 $((${#threads[@]}-1))); do tabs_needed=$tabs_needed"\t" done echo "#SimGrid commit $SGHASH" >> $file_table echo -e "#\t\tconstant${tabs_needed}precise" >> $file_table echo -e "#size/thread$thread_line" >> $file_table ############################################################################### ############################################################################### # START SIMULATION test -e tmp || mkdir tmp me=tmp/`hostname -s` for size in "${sizes[@]}"; do line_table=$size # CONSTANT MODE for thread in "${threads[@]}"; do filename="chord_${size}_threads${thread}_constant.log" output="sr_${size}_threads${thread}_constant.log" rm -rf $filename if [ ! -f chord$size.xml ]; then ./generate.py -p -n $size -b 32 -e 10000 fi if [ ! -f One_cluster_nobb_${size}_hosts.xml ]; then ./generate.py -d -n $size fi echo "$size nodes, constant model, $thread threads" cmd="./chord One_cluster_nobb_"$size"_hosts.xml chord$size.xml --cfg=contexts/stack_size:16 --cfg=network/model:Constant --cfg=network/latency_factor:0.1 --log=root.thres:critical --cfg=contexts/nthreads:$thread --cfg=contexts/guard_size:0" /usr/bin/time -f "$timefmt" -o $me.timings $cmd $cmd 1>/tmp/stdout-xp 2>/tmp/stderr-xp if grep "Command terminated by signal" $me.timings ; then echo "Error detected:" temp_time="errSig" elif grep "Command exited with non-zero status" $me.timings ; then echo "Error detected:" temp_time="errNonZero" else temp_time=$(cat $me.timings | awk '{print $(NF)}') fi # param cat $host_info >> $filename echo "* Experiment settings" >> $filename echo "size:$size, constant network, $thread threads" >> $filename echo "cmd:$cmd" >> $filename #stdout echo "* Stdout output" >> $filename cat /tmp/stdout-xp | grep Amdahl >> $filename #stderr echo "* Stderr output" >> $filename cat /tmp/stderr-xp >> $filename # time echo "* Timings" >> $filename cat $me.timings >> $filename line_table=$line_table$sep$temp_time # Gather SR data from logs echo -e '#id_sr\ttime_taken\tamount_proccesses' >> $output grep 'Total time SR' $filename | awk '{print $7 "\x09" $9 "\x09" $10}' | tr -d ',' >> $output $($cp_cmd $output $dest) $($cp_cmd $filename $dest) rm -rf $filename $output rm -rf $me.timings done #PRECISE MODE for thread in "${threads[@]}"; do echo "$size nodes, precise model, $thread threads" filename="chord_${size}_threads${thread}_precise.log" output="sr_${size}_threads${thread}_precise.log" cmd="./chord One_cluster_nobb_"$size"_hosts.xml chord$size.xml --cfg=contexts/stack_size:16 --cfg=maxmin/precision:0.00001 --log=root.thres:critical --cfg=contexts/nthreads:$thread --cfg=contexts/guard_size:0" /usr/bin/time -f "$timefmt" -o $me.timings $cmd $cmd 1>/tmp/stdout-xp 2>/tmp/stderr-xp if grep "Command terminated by signal" $me.timings ; then echo "Error detected:" temp_time="errSig" elif grep "Command exited with non-zero status" $me.timings ; then echo "Error detected:" temp_time="errNonZero" else temp_time=$(cat $me.timings | awk '{print $(NF)}') fi # param cat $host_info >> $filename echo "* Experiment settings" >> $filename echo "size:$size, constant network, $thread threads" >> $filename echo "cmd:$cmd" >> $filename #stderr echo "* Stderr output" >> $filename cat /tmp/stderr-xp >> $filename # time echo "* Timings" >> $filename cat $me.timings >> $filename line_table=$line_table$sep$temp_time # Gather SR data from logs echo -e '#id_sr\ttime_taken\tamount_proccesses' >> $output grep 'Total time SR' $filename | awk '{print $7 "\x09" $9 "\x09" $10}' | tr -d ',' >> $output $($cp_cmd $output $dest) $($cp_cmd $filename $dest) rm -rf $filename $output rm -rf $me.timings done echo -e $line_table >> $file_table done $($cp_cmd $file_table $dest) rm -rf $file_table rm -rf tmp #+end_src ** Baseline Performance The benchmark can be run from this org-mode file, or simply by running \texttt{./scripts/chord/testall.sh} inside the folder \texttt{examples/msg/chord} of your SimGrid installation. Inside that script, the number of threads to test, as well as the amount of nodes, can be modified The script generates a .csv table, but just in case it is done in different stages, the resulting logs can be processed with \texttt{./scripts/chord/get\_times.py} (located in the same folder as testall.sh). This generates a .csv that can easily be plotted with R/gnuplot. The script is self-documented. #+call: check_args[:session org-sh]() #+call: go_to_chord[:session org-sh]() #+call: testall[:session org-sh](log_folder='timings/logs') ** SR Distribution To enable Scheduling Rounds benchmarks, the constant \texttt{TIME\_BENCH\_ENTIRE\_SRS} has to be defined. It can be defined in \texttt{src/simix/smx\_private.h} The logs give information about the time it takes to run a scheduling round, as well as the amount of processes each SR takes. For this experiment, we are only interested in the amount of processes taken by each SR. The script to run this experiment is \texttt{./scripts/chord/testall\_sr.sh}. It gathers data about the id of each SR, time of each SR and num processes of SR, in stores it in table format. #+call: check_args[:session org-sh]() #+call: go_to_chord[:session org-sh]() #+call: testall_sr[:session org-sh](log_folder='sr_counts/logs') ** SR Times The data set used for this plot is the same as the one before. We just use the data of the sequential simulations (1 thread). ** Binding threads to physical cores The constant \texttt{CORE\_BINDING} has to be defined in \texttt{include/xbt/xbt\_os\_thread.h} in order to enable this optimization. The benchmark is then run in the same way as the Amdahl Speedup experiment. #+call: check_args[:session org-sh]() #+call: go_to_chord[:session org-sh]() #+call: testall[:session org-sh](log_folder='binding_cores/logs') ** parmap between N cores This may be the experiment that requires more work to reproduce: *** maestro works with N-1 threads This is the default setting and the standard benchmark can be used. #+call: check_args[:session org-sh]() #+call: go_to_chord[:session org-sh]() #+call: testall[:session org-sh](log_folder='pmapM_N-1/logs') *** maestro sleeps with N-1 threads To avoid that maestro works with the threads, comment out the line: \texttt{xbt\_parmap\_work(parmap);} from the function \texttt{xbt\_parmap\_apply()} in \texttt{src/xbt/parmap.c} #+call: check_args[:session org-sh]() #+call: go_to_chord[:session org-sh]() #+call: testall[:session org-sh](log_folder='pmap_N-1/logs') *** maestro sleeps with N threads To avoid that maestro works with the threads, comment out the line: \texttt{xbt\_parmap\_work(parmap);} from the function \texttt{xbt\_parmap\_apply()} in \texttt{src/xbt/parmap.c} Then the function \texttt{src/xbt/parmap.c:xbt\_parmap\_new} has to be modified to create one extra thread. It is easy: just add 1 to \texttt{num\_workers} parameter. #+call: check_args[:session org-sh]() #+call: go_to_chord[:session org-sh]() #+call: testall[:session org-sh](log_folder='pmap_N/logs') ** Busy Waiters vs. Futexes performance Enable the use of busy waiters running chord with the extra option: \texttt{--cfg=contexts/synchro:busy\_wait} The experiment was run with testall.sh using that extra option in the chord command inside the script. The tables were constructed using \texttt{get\_times.py}. The data regarding the futexes times is the same gathered in Baseline Performance experiment. #+call: check_args[:session org-sh]() #+call: go_to_chord[:session org-sh]() #+call: testall[:session org-sh](log_folder='busy_waiters/logs') ** SR parallel threshold The data set is the same as SR Distribution and SR times experiments. ** Adaptive threshold The benchmark is done using testall.sh. The algorithm is the one described in section 5.2, and it can be enabled by defining the constant \texttt{ADAPTIVE\_ALGORITHM} in \texttt{src/simix/smx\_private.h} #+call: check_args[:session org-sh]() #+call: go_to_chord[:session org-sh]() #+call: testall[:session org-sh](log_folder='adaptive-algorithm/logs') * Data Analysis :noexport: ** Installing required packages #+begin_src R :exports none install.packages("ggplot2") install.packages("gridExtra") install.packages("reshape") install.packages("plyr") install.packages("data.table") install.packages("stringr") install.packages("grid") #+end_src ** Libraries/Auxiliary functions #+begin_src R :exports none # If you miss the libraries, try typing >>>install.packages("data.table")<<< in a R console library('ggplot2') library('gridExtra') library('reshape') library('plyr') library('data.table') library('stringr') require('grid') # To plot several ggplot in one window. vp.layout <- function(x, y) viewport(layout.pos.row=x, layout.pos.col=y) arrange_ggplot2 <- function(..., nrow=NULL, ncol=NULL, as.table=FALSE) { dots <- list(...) n <- length(dots) if(is.null(nrow) & is.null(ncol)){ nrow = floor(n/2) ; ncol = ceiling(n/nrow) } if(is.null(nrow)){ nrow = ceiling(n/ncol) } if(is.null(ncol)){ ncol = ceiling(n/nrow) } grid.newpage() pushViewport(viewport(layout=grid.layout(nrow,ncol))) ii.p <- 1 for(ii.row in seq(1, nrow)){ ii.table.row <- ii.row if(as.table) { ii.table.row <- nrow - ii.table.row + 1 } for(ii.col in seq(1, ncol)){ ii.table <- ii.p if(ii.p > n) break print(dots[[ii.table]], vp=vp.layout(ii.table.row, ii.col)) ii.p <- ii.p + 1 } } } # Get legend from a given plot g_legend<-function(a.gplot){ tmp <- ggplot_gtable(ggplot_build(a.gplot)) leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box") legend <- tmp$grobs[[leg]] return(legend) } #+end_src #+RESULTS: ** Pre-processing of datasets The .csv files needed for almost all plots are created here, as well as some R data sets that speed things up a little bit. #+name: process_data_sr-times #+begin_src R temp = list.files(path='./logs/sr_counts/logs', pattern="sr_20000_threads1_precise.log", full.names = TRUE) flist <- lapply(temp, read.table) sr_data <- rbindlist(flist) sr_data[, "V1"] <- NULL sr_data = as.data.frame.matrix(sr_data) saveRDS(sr_data, file="./logs/sr_counts/sr-times.Rda") #+end_src #+name: process_data_sr-par-threshold #+begin_src R #PRECISE MODE #SEQUENTIAL temp = list.files(path='./logs/sr_counts/logs', pattern="threads1_", full.names = TRUE) temp <- temp[grepl("precise", temp)] temp <- temp[grepl("25000", temp)] #temp <- temp[-grep("50000", temp)] #temp <- temp[-grep("75000", temp)] flist <- lapply(temp, read.table) sr_data <- rbindlist(flist) #sr_data[, "V1"] <- NULL sr_data = as.data.frame.matrix(sr_data) #df <- ddply(sr_data, .(V3), summarize, mean_value = mean(V2)) #PARALLEL: temp2 = list.files(path='./logs/sr_counts/logs', pattern="threads4_", full.names = TRUE) temp2 <- temp2[grepl("precise", temp2)] temp2 <- temp2[grepl("25000", temp2)] flist2 <- lapply(temp2, read.table) sr_data2 <- rbindlist(flist2) #sr_data2[, "V1"] <- NULL sr_data2 = as.data.frame.matrix(sr_data2) #df2 <- ddply(sr_data2, .(V3), summarize, mean_value = mean(V2)) #CONSTANT MODE #SEQUENTIAL #temp3 = list.files(path='./logs/sr_counts/sequential', pattern="threads4_", full.names = TRUE) #temp3 <- temp3[grepl("constant", temp3)] #flist <- lapply(temp3, read.table) #sr_data3 <- rbindlist(flist) #sr_data3[, "V1"] <- NULL #sr_data3 = as.data.frame.matrix(sr_data3) #df3 <- ddply(sr_data3, .(V3), summarize, mean_value = mean(V2)) #PARALLEL: #temp4 = list.files(path='./logs/sr_counts/parallel', pattern="threads4_", full.names = TRUE) #temp4 <- temp4[grepl("constant", temp4)] #temp4 <- temp4[-grep("50000", temp4)] #temp4 <- temp4[-grep("75000", temp4)] #flist2 <- lapply(temp4, read.table) #sr_data4 <- rbindlist(flist2) #sr_data4[, "V1"] <- NULL #sr_data4 = as.data.frame.matrix(sr_data4) #df4 <- ddply(sr_data4, .(V3), summarize, mean_value = mean(V2)) #Merge PRECISE datasets df5 = merge(sr_data, sr_data2, by = 'V1', incomparables = NULL) df5 <- transform(df5, speedup = V2.x / V2.y) saveRDS(df5, file="./logs/sr_counts/precise.Rda") #Merge CONSTANT datasets #df6 = merge(sr_data3, sr_data4, by = 'V1', incomparables = NULL) #df6 <- transform(df6, speedup = V2.x / V2.y) #df6[, 'speedup'] <- df6[,'mean_value.x'] / df6[, 'mean_value.y'] #saveRDS(df6,file="./logs/sr_counts/constant.Rda") #+end_src #+name: see_percentages_sr-par-threshold #+begin_src: R precise <- readRDS(file="./logs/sr_counts/logs/precise_10000.Rda") under_500 <- precise[precise$V3.x<250,] under_500 <- under_500[complete.cases(under_500),] under_500 <- under_500[is.finite(rowSums(under_500)), ] num_under_500 <- nrow(under_500) # to calculate percentage of SR's with less than 500 processes that had speedup. a <- under_500[under_500$speedup > 1,] n_speedup <- nrow(a) b <- under_500[under_500$speedup <= 1,] n_no_speedup <- nrow(b) # Percentage of SR with less than 500 processes that had/hadnt speedup perc_speedup <- (n_speedup * 100) / num_under_500 perc_no_speedup <- (n_no_speedup * 100) / num_under_500 #+end_src # OPTIONAL: Maybe you want to call this function to be sure that the THREADS and SIZES are the ones you want to plot. #+call: set_python_args() :session #+name: create_table #+begin_src python :session :var elapsed=0 :var amdahl=0 :var memory=0 :var logs_path='"logs"' :var output_file='"logs/total_times.csv"' :results output # This is a set of functions that can generate nice .csv files with # the times of the experiments. Also, the memory consumption can be # gathered. Note that the logs are the ones generated by [[testall]] # code chunk. #Parameters: elapsed: if set to True, then the elapsed time (wallclock) is gathered. # amdahl: if set to True, then the times of the Amdahl benchmark are gathered. # memory: if set to True, then the peak RAM used by the process is gathered. # If none of them is gathered, then the usrtime + systime is gathered. # logs_path: where are stored the logs to analyze. # output_file_path: where to store the produced table # If you make several test of the same experiment, you can name the log files # with a prefix ('1_chord..., 2_chord...') and then put the prefixes # you used in input_seq. The script will average the corresponding values # for you. input_seq = [''] def parse_elapsed_and_memory_used(file): line = file.read().splitlines() l = line[-1] if l: t = float((l.split()[0]).split(':')[1]) mem = float(((l.split()[6]).split(':')[1]).replace('k', '')) mem = mem / (1024.0 * 1024.0) # gigabytes used mem = float(("{0:.2f}".format(mem))) return (t, mem) else: return (0, 0) def parse_memory_used(file): line = file.read().splitlines() l = line[-1] if l: mem = float(((l.split()[6]).split(':')[1]).replace('k', '')) mem = mem / (1024.0 * 1024.0) # gigabytes used mem = float(("{0:.2f}".format(mem))) return mem else: return 0 def parse_elapsed_real(file): line = file.read().splitlines()[-1] if line: return float((line.split()[0]).split(':')[1]) else: return 0 def parse_user_kernel(file): line = file.read().splitlines()[-1] if line: usrtime = float((line.split(":")[2]).split()[0]) systime = float((line.split(":")[3]).split()[0]) return usrtime + systime def parse_amdahl_times(file): line = [line for line in file.read().splitlines() if "Amdahl" in line] line = [(((l.split(";")[0]).split(":")[-1]).strip(), ((l.split(";")[1]).split(":")[1]).strip()) for l in line][0] return float(line[0]) + float(line[1]) def print_header(file): file.write('"nodes"') for mode in MODES: for thread in THREADS: file.write(',"'+mode[0]+str(thread)+'"') file.write('\n') def parse_files(elapsed, amdahl, mem, logs_path, output_file): f = open(output_file, "w") print_header(f) for size in SIZES: temp_line = "{}".format(size) for mode in MODES: for thread in THREADS: sum_l = 0. mem_used = 0. leng = len(input_seq) for seq in input_seq: file = open("{}/chord{}_{}_threads{}_{}.log".format(logs_path, seq, size, thread, mode), "r") if mem and elapsed: tup = parse_elapsed_and_memory_used(file) sum_l += tup[0] mem_used += tup[1] elif elapsed: sum_l += parse_elapsed_real(file) elif amdahl: sum_l += parse_amdahl_times(file) elif mem: sum_l += parse_memory_used(file) else: sum_l += parse_user_kernel(file) if leng != 0: if mem and elapsed: temp_line += ",{0},{1:.2f}".format(datetime.timedelta(seconds=int(sum_l / float(leng))), (mem_used / float(leng))) else: temp_line += ",{}".format(sum_l / float(leng)) else: if mem and elapsed: temp_line += ",?,?" else: temp_line += ",?" f.write(temp_line + "\n") f.close() parse_files(elapsed, amdahl, memory, logs_path, output_file) #+end_src #+call: create_table(0,1,0,'"logs/amdahl/logs"','"logs/amdahl/total_times_amdahl.csv"') :session #+call: create_table(1,0,0,'"logs/timings/logs"','"logs/timings/total_times.csv"') :session #+call: create_table(0,0,1,'"logs/timings/logs"','"logs/timings/memory_consumption.csv"') :session # Call this to change the amount of threads: in the next 2 tables, we dont take the serial benchmarks. #+begin_src python :session # We only test performance improvements in parallel executions with # adaptive algorithm and busy_waiters. THREADS = [2, 4, 8, 16, 24] #+end_src #+call: create_table(1,0,0,'"logs/busy_waiters/logs"','"logs/busy_waiters/total_times_busy.csv"') :session #+call: create_table(1,0,0,'"logs/adaptive_algorithm/logs"','"logs/adaptive_algorithm/total_times_adaptive.csv"') :session # OPTIONAL: This csv is useful for the table of Section 5.2 #+call: create_table(1,0,1,'"logs/timings/logs"','"logs/timings/total_times_memory_adaptive.csv"') :session ** Plotting #+name: baseline_perf #+begin_src R :results output graphics :exports results :file fig/baseline-perf.pdf data = read.csv("./logs/timings/total_times.csv", head=TRUE, sep=',') # Speedups of Precise Mode data[, "baseline"] <- data[, "p1"] / data[, "p1"] data[, "2"] <- data[, "p1"] / data[, "p2"] data[, "4"] <- data[, "p1"] / data[, "p4"] data[, "8"] <- data[, "p1"] / data[, "p8"] data[, "16"] <- data[,"p1"] / data[, "p16"] data[, "24"] <- data[,"p1"] / data[, "p24"] keep <- c("nodes", colnames(data)[grep("^[1-9]", colnames(data))], "baseline") speedup_precise <- data[keep] df2 <- melt(speedup_precise , id = 'nodes', variable_name = 'threads') g2<-ggplot(df2, aes(x=nodes,y=value, group=threads, colour=threads)) + geom_line() + theme(axis.text.x = element_text(angle = -45, hjust = 0), panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(), panel.background = element_blank(), axis.line=element_line(), legend.position="right") + xlab("Amount of nodes simulated") + ylab("Speedup-Precise mode") + scale_fill_discrete(name="threads") + scale_x_continuous(breaks=df2$nodes) g2 #+end_src #+name: sr-distribution #+begin_src R :results output graphics :exports results :file fig/sr-distribution.pdf temp = list.files(path='./logs/sr_counts/logs', pattern="threads4", full.names = TRUE) temp <- temp[grep("precise",temp)] # This data.frame will store the final proportion values. #proportions <- data.frame(stringsAsFactors=FALSE) proportions <- data.frame(row.names = c('1','2','3-5','6-10','11-20','21-30','31+')) head <- c() for(i in temp){ col <- c() # Parse amount of nodes from the file path. # Example of file path: './logs/sr_counts/parallel/sr_10000_threads4_constant.log' nodes = strsplit(str_extract(i, "_[0-9]+_"), "_")[[1]][2] head <- c(head,as.numeric(nodes)) col <- c(col, nodes) # Keep only the column with the amount of processes data <- read.table(i)["V3"] # Calculate proportions data <- prop.table(xtabs(~ V3, data=data)) # Populate a new data frame with percentages of interest (1, 2, 3 or more processes) proc1 <- data["1"][[1]] proc2 <- data["2"][[1]] proc3_5 <- c(data["3"][[1]],data["4"][[1]], data["5"][[1]]) proc6_10 <- c(data["6"][[1]], data["7"][[1]], data["8"][[1]], data["9"][[1]], data["10"][[1]]) proc11_20 <- c(data["11"][[1]], data["12"][[1]], data["13"][[1]], data["14"][[1]], data["15"][[1]], data["16"][[1]], data["17"][[1]], data["18"][[1]], data["19"][[1]], data["20"][[1]]) proc21_30 <- c(data["21"][[1]], data["22"][[1]], data["23"][[1]], data["24"][[1]], data["25"][[1]], data["26"][[1]], data["27"][[1]], data["28"][[1]], data["29"][[1]], data["30"][[1]]) # Calculate final percentages and omit any possible NA proc3_5 <- Reduce("+", proc3_5[!is.na(proc3_5)]) proc6_10 <- Reduce("+", proc6_10[!is.na(proc6_10)]) proc11_20 <- Reduce("+", proc11_20[!is.na(proc11_20)]) proc21_30 <- Reduce("+", proc21_30[!is.na(proc21_30)]) proc31 <- 1 - (proc1 + proc2 + proc3_5 + proc6_10 + proc11_20 + proc21_30) #p <- c(nodes, proc1, proc2, proc3_5, proc6_10, proc11_20, proc21_30, proc31) # And bind to existing data.frame #p <- as.data.frame(p) #p[,'nodes'] <- nodes #p[,'process'] <- c("1","2",">3") proportions <- cbind(proportions, nodes = c(proc1, proc2, proc3_5, proc6_10, proc11_20, proc21_30, proc31)) colnames(proportions)[length(proportions)] <- as.numeric(nodes) } head <- sort(head) cols <- c() for(e in head){ cols <- c(cols,toString(e))} proportions <- proportions[,cols] b <- barplot(as.matrix(proportions), ylab="Proportion of SR's having different number of processes", legend=rownames(proportions), args.legend = list(x = ncol(proportions) + 5.5, bty = "n"), xlim=c(0, ncol(proportions) + 4), las=2, cex.axis = 0.8) title(xlab = "Amount of nodes simulated", line=4) #df <- ddply(proportions, .(nodes,process), summarise, msteps = mean(p)) #g<-ggplot(df, aes(x=nodes, y=msteps, group=process, colour=process)) + geom_line() + # theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(), # panel.background = element_blank(), axis.line=element_line()) + # scale_fill_discrete(name="threads") + # xlab("Amount of nodes simulated") + ylab("Percentage of SR's containing 1,2 or >3 processes") #g #+end_src #+name: sr-times #+begin_src R :results output graphics :exports results :file fig/sr-times.pdf sr_data <- readRDS(file="./logs/sr_counts/sr-times.Rda") #df <- ddply(sr_data, .(V3), summarize, mean_value = mean(V2)) # Replace V2 for 'mean_value' if dont want to plot every dot; and uncomment line above. ggplot(data=sr_data, geom="histogram", aes(x=V3, y=mean_value)) + xlim(0,4000) + ylim(0,0.02) + xlab("Number of processes computed in SR's") + ylab("Average time consumed (seconds)") + geom_point(size = 1) + theme(panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(), panel.background = element_blank(), axis.line=element_line(), legend.position="none") #+end_src #+name: busy #+begin_src R :results output graphics :exports results :file fig/busy.pdf orig_data = read.csv("./logs/busy_waiters/total_times_orig.csv", head=TRUE, sep=',') opt_data = read.csv("./logs/busy_waiters/total_times_busy.csv", head=TRUE, sep=',') # Speedups of Precise Mode opt_data[, "baseline"] <- orig_data[, "p2"] / orig_data[, "p2"] opt_data[, "2"] <- orig_data[, "p2"] / opt_data[, "p2"] opt_data[, "4"] <- orig_data[, "p4"] / opt_data[, "p4"] opt_data[, "8"] <- orig_data[, "p8"] / opt_data[, "p8"] opt_data[, "16"] <- orig_data[, "p16"] / opt_data[, "p16"] opt_data[, "24"] <- orig_data[, "p24"] / opt_data[, "p24"] keep <- c("nodes", colnames(opt_data)[grep("^[1-9]", colnames(opt_data))], "baseline") speedup_precise <- opt_data[keep] df2 <- melt(speedup_precise , id = 'nodes', variable_name = 'threads') g2<-ggplot(df2, aes(x=nodes,y=value, group=threads, colour=threads)) + geom_line() + theme(axis.text.x = element_text(angle = -45, hjust = 0), panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(), panel.background = element_blank(), axis.line=element_line(), legend.position="right") + ylab("Speedup") + xlab("Amount of nodes simulated") + scale_x_continuous(breaks=df2$nodes) g2 #+end_src #+name: sr-par-threshold #+begin_src R :results output graphics :exports results :file fig/sr-par-threshold_40000.png precise <- readRDS(file="./logs/sr_counts/logs/precise_40000.Rda") ggplot(data=precise, geom="histogram", aes(x=V3.x, y=speedup)) +geom_point() + xlim(1,500) +ylim(0,2) + theme(panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(), panel.background = element_blank(), axis.line=element_line(), legend.position="none") + ylab("Speedup of parallel execution against sequential execution") + xlab("Amount of processes computed by each SR") #+end_src #+name: adapt-algorithm #+begin_src R :results output graphics :exports results :file fig/adapt-algorithm.pdf orig_data = read.csv("./logs/adaptive_algorithm/total_times_orig.csv") opt_data = read.csv("./logs/adaptive_algorithm/total_times_adaptive.csv") # Speedups of Precise Mode opt_data[, "baseline"] <- orig_data[, "p2"] / orig_data[, "p2"] opt_data[, "2"] <- orig_data[, "p2"] / opt_data[, "p2"] opt_data[, "4"] <- orig_data[, "p4"] / opt_data[, "p4"] opt_data[, "8"] <- orig_data[, "p8"] / opt_data[, "p8"] opt_data[, "16"] <- orig_data[, "p16"] / opt_data[, "p16"] opt_data[, "24"] <- orig_data[, "p24"] / opt_data[, "p24"] keep <- c("nodes", colnames(opt_data)[grep("^[1-9]+", colnames(opt_data))], "baseline") speedup_precise <- opt_data[keep] df2 <- melt(speedup_precise , id = 'nodes', variable_name = 'threads') g2<-ggplot(df2, aes(x=nodes,y=value, group=threads, colour=threads)) + geom_line() + scale_fill_hue() + theme(axis.text.x = element_text(angle = -45, hjust = 0), panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(), panel.background = element_blank(), axis.line=element_line(), legend.position="right") + scale_y_continuous(breaks=c(1,2))+ ylab("Speedup-Precise mode") + xlab("Amount of nodes simulated") g2 #+end_src #+name: memory-consumption #+begin_src R :results output graphics :exports results :file fig/memory-consumption.pdf data = read.csv("./logs/timings/memory_consumption.csv", head=TRUE, sep=',') df2 <- melt(data, id = 'nodes', variable_name = 'threads') g2<-ggplot(df2, aes(x=nodes,y=value, group=threads, colour=threads)) + geom_line() + theme(axis.text.x = element_text(angle = -45, hjust = 0), panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(), panel.background = element_blank(), axis.line=element_line(), legend.position="right") + xlab("Amount of nodes simulated") + ylab("Memory Consumption (GB)") + scale_fill_discrete(name="threads") + scale_color_manual(values=c('brown1','darkblue','darkorange2','cadetblue2','gold','hotpink4'),labels = c("1","2","4","8","16","24")) g2 #+end_src #+name: real-elapsed-times #+begin_src R :results output graphics :exports results :file fig/real-elapsed-times.pdf data = read.csv("./logs/timings/total_times.csv", head=TRUE, sep=',') df2 <- melt(data, id = 'nodes', variable_name = 'threads') g2<-ggplot(df2, aes(x=nodes,y=value, group=threads, colour=threads)) + geom_line() + theme(axis.text.x = element_text(angle = -45, hjust = 0), panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(), panel.background = element_blank(), axis.line=element_line(), legend.position="right") + xlab("Amount of nodes simulated") + ylab("Elapsed time of simulation (seconds)") + scale_fill_discrete(name="threads") + scale_color_manual(values=c('brown1','darkblue','darkorange2','cadetblue2','gold','hotpink4'),labels = c("1","2","4","8","16","24")) g2 #+end_src #+name: binding #+begin_src R :results output graphics :exports results :file fig/binding.pdf orig_data = read.csv("./logs/binding_cores/total_times_orig.csv", head=TRUE, sep=',') opt_data = read.csv("./logs/binding_cores/total_times_binding.csv", head=TRUE, sep=',') # Speedups of Precise Mode opt_data[, "baseline"] <- orig_data[, "p2"] / orig_data[, "p2"] opt_data[, "2"] <- orig_data[, "p2"] / opt_data[, "p2"] opt_data[, "4"] <- orig_data[, "p4"] / opt_data[, "p4"] opt_data[, "8"] <- orig_data[, "p8"] / opt_data[, "p8"] opt_data[, "16"] <- orig_data[, "p16"] / opt_data[, "p16"] opt_data[, "24"] <- orig_data[, "p24"] / opt_data[, "p24"] keep <- c("nodes", colnames(opt_data)[grep("^[1-9]", colnames(opt_data))], "baseline") speedup_precise <- opt_data[keep] df2 <- melt(speedup_precise , id = 'nodes', variable_name = 'threads') g2<-ggplot(df2, aes(x=nodes,y=value, group=threads, colour=threads)) + geom_line() + scale_fill_hue() + theme(axis.text.x = element_text(angle = -45, hjust = 0), panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(), panel.background = element_blank(), axis.line=element_line(), legend.position="right") + scale_y_continuous(breaks=c(1.0,1.5,2.0,2.5,3.0,4.0)) + ylab("Speedup") + xlab("Amount of nodes simulated") g2 #+end_src #+name: evol-threshold #+begin_src R :results output graphics :exports results :file fig/evol-threshold.pdf data = read.table("./logs/threshold/logs/thresh2_10000_threads4_precise.log", head=TRUE, sep=',') ggplot(data, aes(x=row(data),y=X2)) + geom_line() + theme(axis.text.x = element_text(angle = -45, hjust = 0), panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(), panel.background = element_blank(), axis.line=element_line(), legend.position="right") + xlab("Id of Scheduling Round") + ylab("Value of parallel threshold") #+end_src #+name: all-enabled #+begin_src R :results output graphics :exports results :file fig/all-enabled.pdf orig_data = read.csv("./logs/all_optimized/total_times_orig.csv", head=TRUE, sep=',') opt_data = read.csv("./logs/all_optimized/total_times_all.csv", head=TRUE, sep=',') # Speedups of Precise Mode opt_data[, "baseline"] <- orig_data[, "p1"] / orig_data[, "p1"] opt_data[, "2"] <- orig_data[, "p1"] / opt_data[, "p2"] #opt_data[, "4"] <- orig_data[, "p1"] / opt_data[, "p4"] #opt_data[, "8"] <- orig_data[, "p1"] / opt_data[, "p8"] #opt_data[, "16"] <- orig_data[, "p1"] / opt_data[, "p16"] #opt_data[, "24"] <- orig_data[, "p1"] / opt_data[, "p24"] keep <- c("nodes", colnames(opt_data)[grep("^[1-9]", colnames(opt_data))], "baseline") speedup_precise <- opt_data[keep] df2 <- melt(speedup_precise , id = 'nodes', variable_name = 'threads') g2<-ggplot(df2, aes(x=nodes,y=value, group=threads, colour=threads)) + geom_line() + scale_fill_hue() + theme(axis.text.x = element_text(angle = -45, hjust = 0), panel.grid.major=element_line(colour='grey'),panel.grid.minor=element_blank(), panel.background = element_blank(), axis.line=element_line(), legend.position="right") + ylab("Speedup") + xlab("Amount of nodes simulated") g2 #+end_src #+RESULTS: all-enabled [[file:fig/all-enabled.pdf]] * Emacs Setup :noexport: This document has local variables in its postembule, which should allow org-mode to work seamlessly without any setup. If you're uncomfortable using such variables, you can safely ignore them at startup. Exporting may require that you copy them in your .emacs. # Local Variables: # eval: (org-babel-do-load-languages 'org-babel-load-languages '( (sh . t) (R . t) (perl . t) (ditaa . t) )) # eval: (setq org-confirm-babel-evaluate nil) # eval: (setq org-alphabetical-lists t) # eval: (setq org-src-fontify-natively t) # eval: (unless (boundp 'org-latex-classes) (setq org-latex-classes nil)) # eval: (add-to-list 'org-latex-classes # '("sigalt" "\\documentclass{sig-alternate}" ("\\section{%s}" . "\\section*{%s}") ("\\subsection{%s}" . "\\subsection*{%s}"))) # eval: (add-hook 'org-babel-after-execute-hook 'org-display-inline-images) # eval: (add-hook 'org-mode-hook 'org-display-inline-images) # eval: (add-hook 'org-mode-hook 'org-babel-result-hide-all) # eval: (setq org-babel-default-header-args:R '((:session . "org-R"))) # eval: (setq org-export-babel-evaluate nil) # eval: (setq org-latex-to-pdf-process '("pdflatex -interaction nonstopmode -output-directory %o %f ; bibtex `basename %f | sed 's/\.tex//'` ; pdflatex -interaction nonstopmode -output-directory %o %f ; pdflatex -interaction nonstopmode -output-directory %o %f")) # eval: (setq ispell-local-dictionary "american") # eval: (setq org-export-latex-table-caption-above nil) # eval: (eval (flyspell-mode t)) # End: