Optorsim Faq

The OptorSim Archive of Questions Asked
Caitriana Nicholson, March 2008 This is an edited archive of user questions submitted to the OptorSim mailing lists, with developers' and other users' responses. It is intended as a resource for other users, who may not receive ready responses from the original developers now that they have all moved on to other things. Questions are in plain font and answers are in italics. Some editing of grammar and spelling has been done, but not extensively so don't blame the editor for those! Contents: Current State of the Project Running OptorSim in Windows Running OptorSim in MacOS Configuration File Questions Netbeans and OptorSim Compilation Problems Class File Documentation Various: Initial replica placement, CEs and worker nodes, file pinning, access cost, job processing Simulating Security Functions Timing Model Adding New Replication / Scheduling Strategies Statistics Output Resource Monitoring State of the project What is the current state of this simulator? Is it still being developed, and will there be any new versions? The simulator is not being actively used by people within the EDG project (the project under which it was created). In
fact the EDG project finished a number of years ago. However, others are using and extending the codebase. The project is maintained in a repository at SourceForge (http://sourceforge.net/projects/optorsim) and new developers are welcome to join there, but the original developers are all working elsewhere now and no longer have time to make new releases. Any new questions should be addressed to the mailing list at optorsim-devel@lists.sourceforge.net where they will be answered on a best-effort basis. OptorSim in Windows Windows Path Instructions for UserGuide I am trying to learn about grid simulation tools, and am excited by OptorSim. However, I am stuck using a Windows XP system, and I would recommend adding (on page 4 of the OptorSim v2.0 Installation and User Guide): for Windows users: My Computer -> Properties -> Advanced -> Environmental Variables, then highlight the Path in the System Variables box, and click "Edit", and add to the end of the path: %OptorSim-2.0 Directory%\bin where %OptorSim-2.0 Directory% in my case was C:\optorsim-2.0 Running OptorSim in windows I can't find anybody that know how to run OptorSim in windows. I am not familiar with unix environment. can you tell me how to run OptorSim using windows. the user guide i think more focuses on unix.. Running OptorSim in Windows is pretty much the same as running in Unix. In the optorsim-2.0\bin directory there is a Windows executable called OptorSim.bat. Start up a command prompt, go into the optorsim-2.0 directory and run bin\OptorSim.bat. Edit the examples\parameters.conf file to set the parameters you want. There are instructions for
running in Windows in the user guide, on pages 4 and 5 - all other instructions are the same as for unix. How to execute OptorSim Simulator in windows OS? I downloaded OptorSim simulator, but it not working. I am running this simulator under windows OS. whenever I am using OptorSim.bat, the following error is coming. Exception in thread "main" java.lang.NoClassDefFoundError: org/edg/data/replication/optorsim/OptorSimMain If you are using the OptorSim 2.0 downloaded from the website, and installed it according to the instructions in the userguide, it should work. As Paul said, the classpath set in the OptorSim.bat file assumes you are running from within the optorsim-2.0 directory; if you want to run it from a dif erent directory, please modify the paths in the file so that it can find lib/edg-optorsim.jar, etc, from wherever it is running. OptorSim with MacOS I would like to find out if the simulation tool "OptorSim" can be used on a Macintosh Operating System. In principle, OptorSim can be used on any system that has Java. Getting OptorSim working for Macs inv olved getting Java working. You will need two parts: the build environment (java compiler and the build tool "ant") and the run-time environment (JRE). The web page: http://developer.apple.com/java/ and http://www.pepsan.com/javamac/ seem to be good places to start. A few wrapper scripts are included with OptorSim (in the directory "bin"). These will probably not work f or Macs, but it should be fairly easy to develop Mac equivalents. Configuration Files CMS testbed topology
I want to do an evaluation about our strategy with a promising topology of Grid like "Grid topology for C MS world wide data production challenge in spring 2002" introduced in a paper, "Evaluation of an Economy-Based File Replication Strategy for a DataGrid". As an undesirable case, I can configure the topology, which might b e undesirable in my point. Can you help me to obtain the configuration files of "Grid topology for CMS world wide dat a production challenge in spring 2002"? The CMS testbed configuration files are included in the examples/ directory of OptorSim : cms_testbed_grid.conf cms_testbed_jobs.conf cms_testbed_bandwidths.conf Job probabilitiesHi, Im trying to understand extra examples that are in the web. And I dont know how percentages are calculated. Do you know where percentages came? If I understand your question correctly, you're asking about the following part of the configuration file: \begin{cescheduletable} 0 jpsijob 0.17 highptlepjob 0.34 incelecjob 0.5 incmuonjob 0.67 highptphotjob 0.84 zbbbarjob 1.0 3 jpsijob 0.14 highptlepjob 0.44 incelecjob 0.58 incmuonjob 0.72 highptphotjob 0.86 zbbbarjob 1.0 7 jpsijob 0.17 highptlepjob 0.34 incelecjob 0.5 incmuonjob 0.67 highptphotjob 0.84 zbbbarjob 1.0 #8 jpsijob 0.29 highptlepjob 0.43 incelecjob 0.57 incmuonjob 0.71 highptphotjob 0.86 zbbbarjob 1.0 11 jpsijob 0.17 highptlepjob 0.34 incelecjob 0.5 incmuonjob 0.67 highptphotjob 0.84 zbbbarjob 1.0 12 jpsijob 0.14 highptlepjob 0.28 incelecjob 0.58 incmuonjob 0.72 highptphotjob 0.86 zbbbarjob 1.0 13 jpsijob 0.14 highptlepjob 0.28 incelecjob 0.42 incmuonjob 0.72 highptphotjob 0.86 zbbbarjob 1.0 14 jpsijob 0.14 highptlepjob 0.28 incelecjob 0.42 incmuonjob 0.56 highptphotjob 0.7 zbbbarjob 1.0 15 jpsijob 0.3 highptlepjob 0.44 incelecjob 0.58 incmuonjob 0.72 highptphotjob 0.86 zbbbarjob 1.0
16 jpsijob 0.17 highptlepjob 0.34 incelecjob 0.5 incmuonjob 0.67 highptphotjob 0.84 zbbbarjob 1.0 17 jpsijob 0.14 highptlepjob 0.28 incelecjob 0.42 incmuonjob 0.56 highptphotjob 0.86 zbbbarjob 1.0 \end Percentages are cumulative. For example, the meaning of the row 0 jpsijob 0.17 highptlepjob 0.34 incelecjob 0.5 incmuonjob 0.67 highptphotjob 0.84 zbbbarjob 1.0 is that on Computing Element 0 can run: jpsijob with probability 0.17 highptlepjob with probability 0.34 - 0.17 = 0.17 incelecjob with probability 0.5 - 0.34 = 0.16 incmuonjob with probability 0.67 - 0.5 = 0.17 highptphotjob with probability 0.84 - 0.67 = 0.17 zbbbarjob with probability 1.0 - 0.64 = 0.16 Job running I have one doubt in OptorSim. In job config file I defined some ten jobs. In parameter config file I declare d number.jobs = 100. How will it run the jobs - 10 times for each job? What is the relation with the job sel ection probability? It will run 100 jobs. The jobs it chooses will depend on the job selection probability which you define in t he job configuration file. If you have given all your 10 jobs the same probability, it will run each job 10 times (o n average - it will not be exact). If they have dif erent probabilities, they will run a dif erent number of times. For example, suppose you have the following in your job configuration file: \begin{jobselectionprobability} jobA 0.5 jobB 0.25
jobC 0.15 [...] jobJ 0.05 \end{jobselectionprobability} Then, for 100 jobs, jobA would run about 50 times, jobB about 25 times, jobC about 15 times, and so on. Based upon the selection probability it will run job that much times. My question is whether all jobs will execute sequentially? that is if the job1 have to run 10 times after it ran 10 times only the next job (job2) will run. Is it like that? No, it is not like that. The jobs are chosen "at random", but weighted by their selection probability. A ran dom number between 0 and 1 is generated. This is then compared to the job probability; for example, if job1 has prob ability 0.5, if the random number is less than 0.5 job1 is chosen to run and if it is bigger than 0.5 job1 is not chosen an d another job is considered. You can see the code for it in the randomJob() method of GridContainer. How to calculate the number of files required for a particular job and the file names? Is there any functi on for this implemented in optorsim? What is needed for getBestFile()? if the initial file distribution is more than on e site then will only it be used? How it is related to access cost? These are set in the job configuration file. In the jobtable, you define the set of files for a particular job. Then in the filesetfraction table, you define the fraction of the total fileset which one job needs. So if you have 100 fi les defined in a fileset for job type jobA, and have jobA filesetfraction set at 0.25, each individual instance of jobA will pr ocess 25 files. getBestFile() takes an array of lfns (logical file names) and an array of the corresponding file fractions. Th en for each file in the array, it tries to replicate it according to the chosen replication strategy. Each Optimiser class t herefore has
its own implementation of getBestFile(). If the initial file distribution is for only one site, only that site will be used for the first replications, clearly . If files are on more than one site, all the sites will be considered as sources of replicas. The site that gives the lowes t access cost (or wins the auction in the economic model, or whatever your optimiser does) is chosen to replicate fro m. kSI2000 What is the meaning of kSI(2000)? SI2000 (or CINT2000, but it's easier to use the kilo prefix) is a standard way of measuring CPU performan ce for dif erent machines. See http://www.spec.org/cpu/ for more information, e.g. results for dif erent machines at http://www.spec.org/cpu2000/results/res2007q1/ It's the way that the LCG project uses to calculate its resource requirements. CMS testbed grid In cms_testbed_grid the number of sites mentioned is 27 but while running only 19 sites it shows. Why? Similarly the initial file distribution is in site 14. The site bandwidth in grid conf file shows, site 14 is connected to site 15 no other connection is there. Then how the files are transferred to other sites for job processing. Some of the sites are router sites, so they have no SE or CE - they just transfer the files through them - a nd do not appear in the simulation output. For cms_testbed_grid, site 14 is actually connected to both site 15 (Lyon) and site 23 (a router). Even if a site has only one connection, files can be transferred to other sites as long as they are all connected in the network. F iles can go *through* other sites on the way between the source and destination sites.
Netbeans and OptorSim I am trying to use netbean4.0 (based on Apache Ant) for compiling modified code. Has any body tried that? If yes, could you tell me if any changes are needed? It seems that it does not see some of the packages and complains about some packages saying "package does not exist" at the import statements. It also shows a warning "deprecation: show() in java.awt.window has been deprecated" this.show(). I think I remember someone having this problem before, and if I remember correctly, the solution was to put all the jar files for the external packages together with the optorsim jar file in the lib/ directory. Or else all the jar files had to bein the Netbeans working directory, or in the top optorsim directory... it was certainly something to do with not finding the correct paths to all the libraries. Try playing around with the location of the jar files and see if it works. Compilation Problems Problem with Compiling OptorSimV2 I have a problem with compiling the files with ant command. I didn't add any coding of my own, just the problem pop up when I install and run it by using ant. Buildfile: build.xml init: prepare: build: [javac] Compiling 95 source files to /home/rony/optorsim-2.0/build/classes [javac] /home/rony/optorsim-2.0/src/org/edg/data/replication/optorsim/OptorS imGUI.java:68: error: Type `JTextArea' not found in declaration of field
`tabTer m'. [javac] [javac] [javac] /home/rony/optorsim-2.0/src/org/edg/data/replication/optorsim/OptorS imGUI.java:892: error: Type `JPEGImageEncoder' not found in the declaration of t he local variable èncoder'. [javac] JPEGImageEncoder encoder = private static JTextArea tabTerm; ^
JPEGCodec.createJPEGEnc oder(out); [javac] [javac] /home/rony/optorsim-2.0/src/org/edg/data/replication/optorsim/OptorS imGUI.java:912: error: Type `JPEGImageEncoder' not found in the declaration of t he local variable èncoder'. [javac] JPEGImageEncoder encoder = ^
JPEGCodec.createJPEGEnc oder(out); [javac] [javac] /home/rony/optorsim-2.0/src/org/edg/data/replication/optorsim/OptorS imGUI.java:931: error: Type `JPEGImageEncoder' not found in the declaration of t he local variable èncoder'. [javac] JPEGImageEncoder encoder = ^
JPEGCodec.createJPEGEnc oder(out); [javac] [javac] ^
/home/rony/optorsim-2.0/src/org/edg/data/replication/optorsim/OptorS imGUI.java:951: error: Type `JPEGImageEncoder' not found in the declaration of t he local variable èncoder'. [javac] JPEGImageEncoder encoder = JPEGCodec.createJPEGEnc oder(out); [javac] [javac] /home/rony/optorsim-2.0/src/org/edg/data/replication/optorsim/OptorS imGUI.java:1101: error: Type `JPEGImageEncoder' not found in the declaration of the local variable èncoder'. [javac] JPEGImageEncoder encoder = ^
JPEGCodec.createJPEGEncode r(out); [javac] [javac] 6 errors BUILD FAILED file:/home/rony/optorsim-2.0/build.xml:33: Compile failed; see the compiler error output for details. Both jTextArea (part of Swing) and JPEGImageEncoder (part of the com.sun.image.codec.jpeg package) are included in standard Java-REs (or SDKs) these days. For some reason, it looks like your version of Java doesn't have access to them. Which version of Java are you using? A problem on buiding Optorsim with ant! Several times I tried to re-build the Optorsim with ant as you have been described in the user guide but the build fails with the following error message: ^
Build failed
G:\optorsim-2.1\build.xml:74: execute failed: java.io.IOException:createprocess:bin\optorsimTests.sh error=193 It looks like you are running on a Windows machine, is that correct? From the section of output you sent, it also looks like you were trying to run the functional test suite, i.e. the command: ant func-test I think this may be the problem - the functional tests, which this command runs, only work in a UNIX operating system. Everything else should be fine on Windows, but we only wrote the functional tests for UNIX, as page 4 of the user guide mentions. Building the source for OptorSim itself, just by running 'ant', should work. Documentation - Class explanations I want to use OptorSim for my PhD thesis simulation, how can I find an explanation for each class code ?? I need to understand how each class works before I make any change. After installing OptorSim, you can get the JavaDoc expanation of each class by doing: ant doc which will generate the documentation in html format in the doc/api directory of your optorsim installation directory. Page 4 of the User Guide also outlines this procedure. You can then read these html files with your web browser. This would be the best way for you to get an idea of how the dif erent parts of the simulation work. If you want to see more details of the code itself, please go into the src/ directory and open up the source files to read them. The level of commenting in the code is somewhat variable, however.Initial Replica Placement I have also seen that in OptorSim the initial file and replica placement is made randomly using uniform distribution and I want to know if I can change this by implementing my initial placement strategy? To do this, you would have to change the assignFilesToSites() method in the class JobConfFileReader. Yo u could just
extend this class and override the method in the subclass. CEs and Worker Nodes I am a little not clear about the number of working nodes and computing elements. In the user manual, i t says that a maximum of 1 CE per site. In the code of the GridSite you have a vector for CEs at the site(Vector _computingElementCollection = new Vector();), is this just for future work? It was intended to further develop the model to allow more than 1 CE per site, so the CEs at each site we re implemented as a Vector, but extending this to actually having >1 CE/site was never actually implement ed, so yes, it is 'future work'. I am also not clear about the function returning the total number of computing elements in the GridCont ainer class, is this the total number in the whole Grid system? Yes, it is the total number of CEs in the whole grid. What is the need of worker nodes in OptorSim? Only one job at a time can be processed by CE. Then ho w worker node involved in job processing. The time to process the job is divided by the number of worker nodes, so if there are more worker node s in a CE it processes the job faster. This is a very simple model - Antoine Vernois from Lyon developed a more soph isticated model but I don't think it's included in the release. Access cost I am trying to compute the cost of accessing a file if stored at a certain storage element or site. This is in terms of the bandwidth not the number of hops. The access cost is currently calculated as (file size) / (available bandwidth)
This is in the NetworkClient class, so if you want to modify it that is where you should make the changes. Also is there any function that returns the best route; that is how to figure out the maximum bandwidth available in the route. I hope question is clear cz I wrote in a hurry. The best route is calculated at the beginning of the simulation using a Dijkstra search algorithm - see the GridConfFileReader and GridContainer classes for details. For each pair of grid sites, the best route between them, based on the maximum bandwidth, is calculated. File Pinning What is the meaning of pinned. [ pin status of the file] If a file is pinned, it can't be deleted from the SE until it is unpinned. This is so that if an Optimiser decide s to replicate a particular file, it can prevent it from being deleted until the replication is finished. RB job processing User submitting jobs to the RB, RB submitting jobs to the CE based upon the scheduling algorithm. After all the jobs are submitted only, OptorSim starts processing jobs. Why? Actually, the RB starts processing jobs as soo n as the Users have started submitting them, so there can still be users submitting jobs while the RB is processing them. If there is not a large number of jobs and it goes quickly, however, it might *look* like the RB has not started until all the jobs have been submitted. Simulating security functions Can we simulate grid security functions by using OptorSim? The description of OptorSim at the DataGRI D website only describes data access optimization algorithm simulations. I wonder if it can be used for simulating t he security features. Could you explain what kind of grid security functions you want to simulate? What level of detail are you looking at? It
is currently possible to simulate dif erent site policies (of which jobs to accept) using OptorSim, but investigating security in a more detailed way would require extension to the code. As you say, it is designed for lookin g at data replication algorithms so implementation of things like networking and security are quite high-level , though you could of course modify the code to your own requirements if you wish. We are currently working on the pluggable security services. Initially, we are using the set of services def ined in the OGSA document. The idea behind this effort is to enable a VO members to invoke the set of security ser vices that adapts to their requirements (rather than a 'standard' set of security services). To avoid any mis-match i n the set of services invoked by the various members of the same VO, we are working on the conflict-management p aradigms and will require some mechanism to adequately simulate our propositions. Beside this aspect, we need to ca rry out a number of other simulations like scalability, real-time invocation of these services, invocation by users a s well as by services, ... It is evident that our simulation requirements are quite different from the most intended use of OptorSim - I don't know how much modifications are required in its code to suit us! It looks like it would require substantial code modification to enable OptorSim to match your requireme nts. I would suggest looking for some lower-level simulators, although you can download the code if you want to exa mine it more closely Timing Model OptorSim v. SimGrid I am working on replication and caching optimization algorithms. I would like to know if OptorSim would be fast
enough. Some people claim that since it is written in Java it is not going to be as good as Simgrid (written in C). Can you please comment and advise me if OptorSim will be efficient enough? Secondly, if my idea involves adding other components to the simulator, is this doable; i.e., adding my own work and use it as part of the simulator? [Antoine Vernois] to my mind, it's true that OptorSim is not as fast as Simgrid, but it's not due to language. It's due to the fact that Simgrid is event-driven, ie time is advanced in calculated step while OptorSim is kind of real-time simulator. So for example, simulation of a grid for 10h can only take 1hour with Simgrid (but it depend on what you simulate, it can also take 10min, or 4hours or more), while it will take 10 hours with Optorsim. Hopefully there is a scale factor in OptorSim that allow you to speedup the time (by dividing all sleep time by this factor). [Editor's Note: The above was true for OptorSim 1.0 in version 2.0 and above it also uses a more eventdriven model and no longer goes in real time (although the option to do so is still there).] But I think that the choice of Simgrid or Optorsim, should not be done for their execution performance, but for tools they of er to you. For example, i choose to use Optorsim (while main part of Simgrid is developed in my lab :-) because it includes all mechanisms to manage data and their replication. It gives me routines to locate, retrieve, delete data and gives me quite good estimation of access time. Moreover the global architecture (following EDG architecture) is already implemented and is fine for my need. In Simgrid, you have all tools to do that, but you have to do it yourself ! You will have to implement repli ca manager,storage element and so on... Another point to look is the way the bandwidth sharing is simulated. In OptorSim the model is quite simple but ef icient
enough if you manage lots of transfers. But i think that improvement of this point is in OptorSim develop ers 's to dolist. So, the choice of OptorSim or Simgrid mainly depends on your own needs. As a user of OptorSim with pa rticular requirements, i added lots of things to the OptorSim core to match my needs. It's quite easy as the code is well commented and quite easy to understand. [David Cameron] As you say Antoine, the dif erence in speed between the two simulations is the time model used, not the language. I think the main consideration for which simulation to use should be how easy it is to adapt for your particular purpose. In my experience anyway most of the time is spent developing and testing simulation code rather than getting results. Since you say you are interested in replication caching strategies I think OptorSim already has all the features necessary for this and should be easy to expand and implement your own algorithms. Of course since I am one of the developers of OptorSim maybe I am biased towards it ;) but I think it wo uld involve more work on your part since OptorSim was designed to test replication strategies. As for adding your o wn code, the license means you are free to do what you like to the code as long as you acknowledge the original auth ors and keep the copyright headers in each java file. Timing bug? While using the simulator, I have encountered many strange bugs. For example, you claim that the time measured by OptorSim does not depend on the computer on which it runs on (the time you use does not depend on system time). However, by doing several simulations, I found out that the times differ significantly (sometimes order of magnitude)
when run on different machines. I double checked the input parameters and the "time.advance" parameter is set to "yes". Is this a known problem, or am I interpreting this parameter wrongly? For the time.advance, you are right: it should be independent of the underlying CPU speed. Essentially, time (within the simulation) is frozen whilst anything is being calculated. Once all simulation work is finished there is a step-wise jump in time to the next time that something would "happen". This should be independent of actual CPU ef ort, although the time taken to simulate any given grid configuration will obviously depend on the machine's CPU. Just to be sure the parser has picked up the time.advance option, could you check whether the CPU utilisation stays high (over 90%) for the duration of the simulation? If it doesn't, then its using "linear" time and the discrepancy would be expect. If the CPU utilisation does stay high, then there's a bug somewhere. Adding new replication or scheduling strategies New replication strategy My work is about initial replica placement in Grid. In fact, my goal is to place initially in the grid (when th e file f is created) a number R of replicas of the file f to improve fault tolerance. Unfortunately, our faculty doesn' t have the necessary infrastructure for my experiments. I am very interested by your simulator OptorSim. I have re ad the user guide and I have some questions. Is it possible for me to add new replication strategies? Yes; by extending the Optimiser classes appropriately, you can add your own replication strategy fairly e asily. You would probably also have to extend the StorageElement classes according to your strategy. New scheduling algorithm
i am proposing new scheduling and replication algorithm. i am using optorsim2.1 for my project. how to add my new scheduling algorithm in OptorSim. OK, so if you mention two things: a scheduling- and replication- algorithm. OptorSim's main focus is on replica-optimisation, so that's in a more advanced state; so, if I were you, I would start with that. In /src/org/edg/data/replication/optorsim/optor directory you should see the available optimisation algorithms. The key thing is that they all implement the Optimisable interface: this is how the rest of the software interacts with the replica optimisation strategy. The replica optimisation algorithms already in OptorSim form a strong class-hierarchy. At the bottom is a skeleton class (SkelOptor) that implements some very basic functionality. All the others extend either this class or some other (abstract) class. To create your own replica optimisation algorithm, you must create a new class in this directory. Your new class can either implement the Optimisable interface directly or extend one of the existing classes; it just depends on how your algorithm is going to work. For job scheduling, have a look in ResourceBrokerFactory class. This is a singleton class that is used to return a singleton object, an object that implements the ResourceBroker interface. To begin with, you probably want to write a new class that extend the extends the skeleton implementation (SkelResourceBroker). Have a look at RandomCEResourceBroker to get the idea. Thanks for your suggestion. According to your suggestion I am going to start implementing my Replicatio n strategy (Best Client). I have some doubt in that. I listed my doubts below.
1. In job config file (cms_testbed_jobs.conf), the third row is the schedule table. It contains the sites and the jobs they are willing to run. My question is, is it those sites alone which will run jobs while executing the program?
Yes. And the sites which are not specified will be idle at that time. Is it so? Yes, indeed. This is a Grid paradigm. Although Grid computing is about providing access to a large compu ting facility, a key aspect is that each site can choose which "virtual organisation" (VO) they wish to support. You can see this with live WLCG data here: ht tp://gridmap.cern.ch/gm/ Click on the dif erent VOs and notice that some sites turn white, indicating they don't support that VO (all sites support the OPS VO, though). When someone submits a job (or, more likely, a series of jobs) to the Grid, they identify themselves as b eing a member of a VO (in the HEP world ATLAS, CMS and LHCb are examples of VOs). This user can run their jobs, acces s and store data (etc.) because of their membership of that VO. Likewise, each site can choose which VOs they wish to support and how much they want to support the m. So, a site may choose to dedicated themselves to a particular VO, whereas other sites may strongly favour one VO but allow work from many other VOs. OptorSim attempts to simulate this ef ect. The cescheduletable describes which jobs a site is willing to run. In reality, a site does not decide job-by-job (instead, the decision is based on a number of factors, the most promin ent being the VO membership of the person submitting the job), but this should allow us to simulate real job-submissi on pattern. And, yes, this might result in idle CPUs. But it's important to allow the sites their autonomy .. and in pra ctice, this is
unlikely to be a problem. The computer hardware is bought to match predicted demand, so it's fairly un likely that computers will be idle. 2. Whether I have to create the config files first and then I have to start coding? No, you should be able to use the existing files. The "simple" ones are a good place to start: simple_grid. conf simple_job.conf. Just update the parameters.conf file. You will need to add support for a new scheduler (i.e. number 5). 3. Shall I use the existing config files for my proposed algorithm or do I have to create my own? Whichever you feel more comfortable with; either will work. Personally, I would copy the existing one and edit it. That way you have a local copy that hasn't been altered, allowing you to see what you've changed. 4. When developing my own scheduling algorithm in OptorSim, whether I have to change any existing c ode in OptorSim or I can just inherit the classes alone. Again, that depends on how you are going to implement your new class. Fundamentally, as long as your class implements ResourceBroker (with correct semantics) it'll work. However, there's a pretty useful skeleton class SkelResourceBroker that does much of the boring work, especially with thread priorities. I would recommend writing the new RB that extends SkelResourceBroker. You then n eed only implement the findCE() method. 5. I am going to implement best client replication strategy, which is the site from where more requests c ome. I have to replicate the files to that site. For that from where and how i can get the required data. You haven't said how you plan to decide *when* to replicate a file: you'll need to limit this somehow, otherwise a site can (potentially) attempt to transfer so many files that none of the transfers will proceed. Also, the simulator was
designed so the SEs pulled files they wanted, rather than an external agent pushing them. You can, of course, implement a coordinating agent that decides which files to replicate. In general, this is one of the problems with distributed computing: how to collect information from many sites without introducing a single-point-of-failure or performance bottleneck. In the simulator it is easy to cheat: one has (potentially) complete access to any object's information and the cost of accessing this information is small (it's all within the computer's memory). You could record the access patterns locally (on the CE, for example) and provide a method for accessing those values. However, in real life, it becomes more complex. There is a very large number of files being stored, with jobs requesting them in complex patterns. One cannot record all file requests centrally as they happen far too frequently. Registering the files would become a bottleneck and single-point-of-failure. The solution used in the simulator is to hold an auction for each file request over a pier-to-pier (P2P) network. Sites can choose to participate in an auction (they do by default, but they can time-out without af ecting the process). The auction has two purposes: it selects the best available copy of a file and it also allows "nearby" sites to know that a particular file was requested. The second purpose allows a decentralised knowledge of file requests without imposing a heavy-weight solution (such as registering each file request). The site itself will initiate the transfer, so it can (potentially) know the access patterns and how "hot" is any particular file that isn't held locally. You will need to figure out how it can determine whether file X is "suf iciently hot" that it is worth replicating it to the local storage (the site's SE).
I am expecting your reply for the following three questions mentioned below: (i) I kindly request you to explain the working of the two scheduling algorithm(access cost and Queue Ac cess Cost). like how the execution is happening in OptorSim. You will find the implementation of the Access Cost algorithm in the AccessCostResourceBroker class, an d Queue Access Cost in the CombinedCostResourceBroker class. Both of them extend SkelResourceBroker with a dif erent findCE method. So for your algorithm, you should write a new class (e.g. QueueExecutionTimeResourceB roker) which also extends SkelResourceBroker with a dif erent findCE method. AccessCostResourceBroker, when it is given a job to allocate, iterates through all the available CEs. First i t checks whether the CE will accept that job and whether it has space in the job handler queue. If these are ok, it then calls thegetAccessCost method of the optimiser. This calculates the cost of accessing the files, depending on the optimiser method selected (e.g. LFU, LRU, economic model). The CE which has the lowest access cost for the job is then selected and the job sent there. CombinedCostResourceBroker works in a similar way, but as well as calculating the access cost for the jo b in question, it also access the job handler of the CE it is looking at and gets the access cost of each job in th e queue (you can see this in the getQueueAccessCost method of the JobHandler class). It combines the two costs, and the CE with the lowest total cost is chosen. (iii) The scheduling algorithm which i am trying to implement is Queue Execution Time. the explanation of that algorithm is as follows. Queue Execution Time - Execution time of current job+All the jobs in the queue where,
Execution Time of current job= Access cost of remote files+execution time of all the files similar to Queue Access cost we are calculating our algorithm QueueExecutionTime. in QueueAccessCost , the access cost of current job and all the jobs in the queue is calculated. we are calculating in addition the executio n time of files to run the job with the access cost. My question is, to implement this algorithm from where i have to start. I think you should start, as I mentioned above, by writing a new extension of SkelResourceBroker. You c ould copy most of what is in CombinedCostResourceBroker, and simply add the execution times to the total cost for eac h CE. If you are adding the execution time for all jobs in the queue, you would also need to modify the getQueueAcc essCost method in JobHandler, or add a new method such as getQueueExecutionCost, which would add the execution ti me for each job as well as the access time. So, in fact, I think it is not too dif icult for you to implement your algorithm. Will a new optimisation strategy affect schedulers? all the scheduling algorithms are based on the optimisation strategy. i also going to implement new opti misation strategy BestClient. that is if the number of file request increases and reached the threshold value then only the files will be replicated. for this algorithm to implement where i have to start. whether it will affect the existin g access cost function. The scheduling algorithms work independently of the file replication algorithms so your new strategy sh ouldn't af ect the existing scheduling algorithms. You will need to write a new Optimiser class, e.g. BestClientOptimise r, which extends ReplicatingOptimiser with a new getBestFile() method. If you don't put a getAccessCost method in it, it will use
the existing getAccessCost method from SkelOptor so it won't be af ected. You will also need a corresponding StorageElement class, e.g. BestClientStorageElement, which defines which file(s) to delete when the SE i s full. Have a look at the existing Optimiser and StorageElement classes to see how it works. Neighbour gridsites information I have a problem in OptorSim programming. The problem is how to get the neighbors Gridsite's data files stored in their SEs? Can you explain your problem in some more detail? If sites choose not to replicate a data file to their own SE, they can read it remotely from another site using the simulateRemoteIO method in the SimpleComputingElement class. Do you mean this, or do you mean replicating the data file from another site to its own SE? Perhaps I can give a better explanation if you can tell us some more about what you want to do.Suppose that many jobs are submitted to each gridsite, this lead to the jobs in the same gridsite will read data files in its local SEs or replicated from the remote gridsite. The neighbor gridsites may be a good choise for replicating data files. Therefore, the question is how to get the data files' logical names stored in SEs at the neighbor gridsites in the run-time . Let me see if I understand you correctly. You want to write some code for OptorSim which will look at the neighbouring grid sites and get the list of LFNs (logical filenames) of files in these sites. First, to get the list of neighbouring sites for a particular site, you can use the method neighbouringSites() in the class GridSite. Second, to get the files at these sites, you will have to use GridSite.getSE() to get the SE at each site, and
StorageElement.listFiles() to get the list of files in a human-readable format. You can also use StorageElement.getAllFiles() to return them in the form of a hashtable. Yes, that's my purpose. I tried these methods.
When I use neighbouringSites(), GridSite.getSE(), StorageElement.listFiles(), It seems can get the neighbo ur GridSite and its SEs,but no LFNs. So, I try ReplicaManager.listReplicas() to get all the replicas' name. This method works ok, but It cannot t ell me the replica which gridsite it belongs to. Statistics Output Reading Output For evaluating the algorithm is there any other tool or software? I have to evaluate my scheduling algori thm with the existing one implemented in OptorSim with the help of charts - how to do that? We normally used the statistics output which OptorSim gives at the end, writing all the OptorSim output to a file and then using some scripts to extract the information we were interested in. The plots were then drawn using some separate software, inputting the data 'by hand'. If you have the statistics level in the parameters file set to 3, it gives the maximum information. For example, if you are interested in evaluating your algorithm by comparing the job times with other algorithms, you can collect the totalJobTime information for all the sites from the statistics output and get the mean to give you the 'mean job time' variable that we used. . If there is some information you need which is not output, you can modify the getStatistics
method of the various grid elements (SE, CE, GridSite, GridContainer) to output what you want. I've attached the script we used to extract the mean job time, to give you some idea, but you can probably come up with a solution yourself to meet your own needs! Otherwise, you can use the GUI (see section "Using the Graphical User Interface) in the user guide, but this is not so useful if you are running a lot of experiments in batch mode. CE usage How to calculate the CEusage for the entire grid in optorsim? When OptorSim finishes running and the Statistics tree is printed out at the end, you will see that the fir st element listed there is the GridContainer. This is the whole grid - you will see there is an item there called ceUsage, whi ch is what I think you are looking for. It looks like this: ResourceBroker> all jobs finished, shutting down P2P network ... Statistics for the GridContainer taken Fri Sep 09 12:21:49 BST 2005 | remoteReads = 0 | localReads = 74746 | ENU = 0.9941268 | replications = 74307 | ceUsage = 56.221153 | totalJobTime = 7.2084216E7 | In this example, there was a CE usage of 56.22% for the whole grid. Using the GUI I am seeing the grid output using the GUI. In that, after i ran the simulator in the summary table the perc entage of ce
usage is showing zero. Using the GUI, how to see or calculate the ce usage? I am not able to see the wh ole output through the command prompt.. so tell me for GUI option. I think that the easiest way for you to get the information, even if you are using the GUI, is to get it from the terminal output. OptorSim still outputs to the terminal even if you are using the GUI. If you redirect the output to a file, you will then be able to examine it at your leisure, e.g bin\OptorSim.bat > myResults.txt should do it (although I am not so familiar with using the Windows command prompt).. the GUI should s till open up as usual, and at the end you can read the output as well as using any results you have saved from the GUI. Otherwise, you if you really wanted a CE usage tab to appear in the GUI for the 'Grid' node, you would h ave to do some modification to the GUI code, which probably isn't worth it.. Memory used by Statistics When the simulation is running the memory usage is constantly increasing. I suspect this is because of all the sttistics OptorSim is collecting. Is it possible to turn all the statistics off. I know there is a parameter which specifies how detailed the statistics is, but this parameter just defines what the simulator prints and not what is being collected during the simulation. The problem arises when the simulated topology is quite big. In such instances the memory usage becomes really high. Unfortunately, there isn't any options to switch of collecting statistics. However, it should be fairly simple to disable the code that stores the statistics you're not interested in. Simply comment out the relevant parts in (for example) SimpleComputingElement.java should do the trick.
Remote reads per file Each site in the grid should maintain the number of times a file is accessed from the remote site. [for eg. Job1 needs file1 and file2. job2 needs file1. if job1 and job2 runs in site1 then the file 1 is accessed two times from t he remote site and file 2 is accessed once from remote site] Is there any function implemented in optorsim for that. If it is not, How to implement that. The getStatistics method for CEs returns the total number of remote reads and local reads by that CE, bu t it doesn't store the number per file. You could add some more instrumentation to getStatistics for the CE to store that information if you like. Resource Monitoring In OptorSim, the resource availability is determined, which means the available resource (CE, SE and network bandwidth) can be either known beforehand or be calculated when scheduling decisions are made. Is my understanding right? Yes that's right. The resource broker has all information about the load at each CE and the network. Realistically, the resource availability should be fed to broker, especially when resources are not dedicated and/or there are multiple brokers. Therefore, the quality and freshness of reported by the Grid monitoring function is very important.True, we have ignored simulating any monitoring system and assume the resource broker has perfect information, which is easy of course in a simulation but not realistic in real life! I intend to add a monitoring module into OptorSim to feed the scheduler/resource broker. The monitoring function is
responsible to monitor local resource (CE, SE, etc) and report the monitoring information to consumer (broker). Could you suggest whether it's feasible to extend the OptorSim to have the monitoring function and where shall I start to do this? This sounds like a good idea, if your aim is to simulate the ef ects of monitoring ef iciencies on the ef iciency of running jobs. Maybe you want to implement a P2P agent which sends monitoring information running at each site, similar to the ones we have for the auction protocol, and a scheduling algorithm which uses information gathered from these agents. The OptorSim code should (I hope!) make this relatively easy to do.

Optorsim Faq

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optorsim Faq

Uploaded by

Copyright:

Available Formats

The OptorSim Archive of Questions Asked

JPEGCodec.createJPEGEnc oder(out); [javac] [javac] ^

You might also like