Here you can find some of our group’s publications.


2022

  • [1] S. Tuli, S. {Singh Gill}, M. Xu, P. Garraghan, R. Bahsoon, S. Dustdar, R. Sakellariou, O. Rana, G. Casale, and {. R. }. Jennings, “HUNTER:: AI based holistic resource management for sustainable cloud,” Journal of Systems and Software, vol. 184, 2022. doi:10.1016/j.jss.2021.111124
    [BibTeX] [Abstract]

    The worldwide adoption of cloud data centers (CDCs) has given rise to the ubiquitous demand for hosting application services on the cloud. Further, contemporary data-intensive industries have seen a sharp upsurge in the resource requirements of modern applications. This has led to the provisioning of an increased number of cloud servers, giving rise to higher energy consumption and, consequently, sustainability concerns. Traditional heuristics and reinforcement learning based algorithms for energy-efficient cloud resource management address the scalability and adaptability related challenges to a limited extent. Existing work often fails to capture dependencies across thermal characteristics of hosts, resource consumption of tasks and the corresponding scheduling decisions. This leads to poor scalability and an increase in the compute resource requirements, particularly in environments with non-stationary resource demands. To address these limitations, we propose an artificial intelligence (AI) based holistic resource management technique for sustainable cloud computing called HUNTER. The proposed model formulates the goal of optimizing energy efficiency in data centers as a multi-objective scheduling problem, considering three important models: energy, thermal and cooling. HUNTER utilizes a Gated Graph Convolution Network as a surrogate model for approximating the Quality of Service (QoS) for a system state and generating optimal scheduling decisions. Experiments on simulated and physical cloud environments using the CloudSim toolkit and the COSCO framework show that HUNTER outperforms state-of-the-art baselines in terms of energy consumption, SLA violation, scheduling time, cost and temperature by up to 12, 35, 43, 54 and 3 percent respectively.

    @article{f6b92d1412d44143b84dc9436a9fb2ce,
    author = "Tuli, Shreshth and {Singh Gill}, Sukhpal and Xu, Minxian and Garraghan, Peter and Bahsoon, Rami and Dustdar, Scharam and Sakellariou, Rizos and Rana, Omer and Casale, Giuliano and Jennings, {Nicholas R.}",
    title = "HUNTER:: AI based holistic resource management for sustainable cloud",
    abstract = "The worldwide adoption of cloud data centers (CDCs) has given rise to the ubiquitous demand for hosting application services on the cloud. Further, contemporary data-intensive industries have seen a sharp upsurge in the resource requirements of modern applications. This has led to the provisioning of an increased number of cloud servers, giving rise to higher energy consumption and, consequently, sustainability concerns. Traditional heuristics and reinforcement learning based algorithms for energy-efficient cloud resource management address the scalability and adaptability related challenges to a limited extent. Existing work often fails to capture dependencies across thermal characteristics of hosts, resource consumption of tasks and the corresponding scheduling decisions. This leads to poor scalability and an increase in the compute resource requirements, particularly in environments with non-stationary resource demands. To address these limitations, we propose an artificial intelligence (AI) based holistic resource management technique for sustainable cloud computing called HUNTER. The proposed model formulates the goal of optimizing energy efficiency in data centers as a multi-objective scheduling problem, considering three important models: energy, thermal and cooling. HUNTER utilizes a Gated Graph Convolution Network as a surrogate model for approximating the Quality of Service (QoS) for a system state and generating optimal scheduling decisions. Experiments on simulated and physical cloud environments using the CloudSim toolkit and the COSCO framework show that HUNTER outperforms state-of-the-art baselines in terms of energy consumption, SLA violation, scheduling time, cost and temperature by up to 12, 35, 43, 54 and 3 percent respectively.",
    keywords = "Cloud computing, sustainable computing, resource scheduling, datacenters",
    note = "This is the author{\textquoteright}s version of a work that was accepted for publication in Journal of Systems and Software. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Journal of Systems and Software, 184, 2022 DOI: 10.1016/j.jss.2021.111124",
    year = "2022",
    month = "February",
    day = "28",
    doi = "10.1016/j.jss.2021.111124",
    language = "English",
    volume = "184",
    journal = "Journal of Systems and Software",
    issn = "0164-1212",
    publisher = "Elsevier Inc.",
    pdf = ""
    }


  • [2] G. Yeung, D. Borowiec, R. Yang, A. Friday, R. H. R. Harper, and P. Garraghan, “Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems,” IEEE Transactions on Parallel and Distributed Systems, vol. 33, iss. 1, p. 88–100, 2022. doi:10.1109/TPDS.2021.3079202
    [BibTeX] [Abstract] [pdf][Download PDF]

    To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of-the-art resource managers are needed to increase GPU utilization and maximize throughput. While co-locating DL jobs on the same GPU has been shown to be effective, this can incur interference causing slowdown. In this paper we propose Horus: an interference-aware and prediction-based resource manager for DL systems. Horus proactively predicts GPU utilization of heterogeneous DL jobs extrapolated from the DL model{’}s computation graph features, removing the need for online profiling and isolated reserved GPUs. Through micro-benchmarks and job co-location combinations across heterogeneous GPU hardware, we identify GPU utilization as a general proxy metric to determine good placement decisions, in contrast to current approaches which reserve isolated GPUs to perform online profiling and directly measure GPU utilization for each unique submitted job. Our approach promotes high resource utilization and makespan reduction; via real-world experimentation and large-scale trace driven simulation, we demonstrate that Horus outperforms other DL resource managers by up to 61.5\% for GPU resource utilization, 23.7–30.7\% for makespan reduction and 68.3\% in job wait time reduction.

    @article{73c3a50be6824a859b434871b9b583fa,
    author = "Yeung, Ging-Fung and Borowiec, Damian and Yang, Renyu and Friday, Adrian and Harper, R.H.R. and Garraghan, Peter",
    title = "Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems",
    abstract = "To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of-the-art resource managers are needed to increase GPU utilization and maximize throughput. While co-locating DL jobs on the same GPU has been shown to be effective, this can incur interference causing slowdown. In this paper we propose Horus: an interference-aware and prediction-based resource manager for DL systems. Horus proactively predicts GPU utilization of heterogeneous DL jobs extrapolated from the DL model{\textquoteright}s computation graph features, removing the need for online profiling and isolated reserved GPUs. Through micro-benchmarks and job co-location combinations across heterogeneous GPU hardware, we identify GPU utilization as a general proxy metric to determine good placement decisions, in contrast to current approaches which reserve isolated GPUs to perform online profiling and directly measure GPU utilization for each unique submitted job. Our approach promotes high resource utilization and makespan reduction; via real-world experimentation and large-scale trace driven simulation, we demonstrate that Horus outperforms other DL resource managers by up to 61.5\% for GPU resource utilization, 23.7–30.7\% for makespan reduction and 68.3\% in job wait time reduction.",
    keywords = "distributed computing, Deep Learning, interference, cloud computing, GPU Scheduling",
    year = "2022",
    month = "January",
    day = "31",
    doi = "10.1109/TPDS.2021.3079202",
    language = "English",
    volume = "33",
    pages = "88--100",
    journal = "IEEE Transactions on Parallel and Distributed Systems",
    issn = "1045-9219",
    publisher = "IEEE Computer Society",
    number = "1",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/325262447/TPDS\_Horus\_4\_.pdf"
    }


  • [3] S. {Shahab Nabavi}, S. {Singh Gill}, M. Xu, M. Masdari, and P. Garraghan, “TRACTOR: Traffic‐aware and power‐efficient virtual machine placement in edge‐cloud data centers using artificial bee colony optimization,” International Journal of Communication Systems, vol. 35, iss. 1, 2022. doi:10.1002/dac.4747
    [BibTeX] [Abstract] [pdf][Download PDF]

    Technology providers heavily exploit the usage of edge‐cloud data centers (ECDCs) to meet user demand while the ECDCs are large energy consumers. Concerning the decrease of the energy expenditure of ECDCs, task placement is one of the most prominent solutions for effective allocation and consolidation of such tasks onto physical machine (PM). Such allocation must also consider additional optimizations beyond power and must include other objectives, including network‐traffic effectiveness. In this study, we present a multi‐objective virtual machine (VM) placement scheme (considering VMs as fog tasks) for ECDCs called TRACTOR, which utilizes an artificial bee colony optimization algorithm for power and network‐aware assignment of VMs onto PMs. The proposed scheme aims to minimize the network traffic of the interacting VMs and the power dissipation of the data center’s switches and PMs. To evaluate the proposed VM placement solution, the Virtual Layer 2 (VL2) and three‐tier network topologies are modeled and integrated into the CloudSim toolkit to justify the effectiveness of the proposed solution in mitigating the network traffic and power consumption of the ECDC. Results indicate that our proposed method is able to reduce power energy consumption by 3.5\% while decreasing network traffic and power by 15\% and 30\%, respectively, without affecting other QoS parameters.

    @article{d2fe03e0e9ca459eb2358f7e8ba63175,
    author = "{Shahab Nabavi}, Sayyid and {Singh Gill}, Sukhpal and Xu, Minxian and Masdari, Mohammad and Garraghan, Peter",
    title = "TRACTOR: Traffic‐aware and power‐efficient virtual machine placement in edge‐cloud data centers using artificial bee colony optimization",
    abstract = "Technology providers heavily exploit the usage of edge‐cloud data centers (ECDCs) to meet user demand while the ECDCs are large energy consumers. Concerning the decrease of the energy expenditure of ECDCs, task placement is one of the most prominent solutions for effective allocation and consolidation of such tasks onto physical machine (PM). Such allocation must also consider additional optimizations beyond power and must include other objectives, including network‐traffic effectiveness. In this study, we present a multi‐objective virtual machine (VM) placement scheme (considering VMs as fog tasks) for ECDCs called TRACTOR, which utilizes an artificial bee colony optimization algorithm for power and network‐aware assignment of VMs onto PMs. The proposed scheme aims to minimize the network traffic of the interacting VMs and the power dissipation of the data center's switches and PMs. To evaluate the proposed VM placement solution, the Virtual Layer 2 (VL2) and three‐tier network topologies are modeled and integrated into the CloudSim toolkit to justify the effectiveness of the proposed solution in mitigating the network traffic and power consumption of the ECDC. Results indicate that our proposed method is able to reduce power energy consumption by 3.5\% while decreasing network traffic and power by 15\% and 30\%, respectively, without affecting other QoS parameters.",
    keywords = "Cloud Computing, VM Placement, Artificial Bee Colony, Power Consumption, Network Traffic, Cloud Data Centers",
    note = "This is the peer reviewed version of the following article: TRACTOR: Traffic‐aware and power‐efficient virtual machine placement in edge‐cloud data centers using artificial bee colony optimization. doi: 10.1002/dac.4747 which has been published in final form at http://onlinelibrary.wiley.com/doi/10.1002/dac.4747/abstract This article may be used for non-commercial purposes in accordance With Wiley Terms and Conditions for self-archiving.",
    year = "2022",
    month = "January",
    day = "31",
    doi = "10.1002/dac.4747",
    language = "English",
    volume = "35",
    journal = "International Journal of Communication Systems",
    issn = "1074-5351",
    publisher = "John Wiley and Sons Ltd",
    number = "1",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/318758820/TRACTOR\_Final\_2\_Feb\_.pdf"
    }


  • [4] D. Borowiec, R. H. R. Harper, and P. Garraghan, “Environmental Consequence of Deep Learning,” ITNOW, vol. 63, iss. 4, p. 10–11, 2022. doi:10.1093/itnow/bwab099
    [BibTeX] [Abstract]

    Deep learning and artificial intelligence are often viewed as panacea technologies — ones which can decarbonise many industries. But what is the carbon cost of these systems? Damian Borowiec, Richard R. Harper and Peter Garraghan discuss.

    @article{8c17187839184c2383fa639bc7ee65d3,
    author = "Borowiec, Damian and Harper, R.H.R. and Garraghan, Peter",
    title = "Environmental Consequence of Deep Learning",
    abstract = "Deep learning and artificial intelligence are often viewed as panacea technologies — ones which can decarbonise many industries. But what is the carbon cost of these systems? Damian Borowiec, Richard R. Harper and Peter Garraghan discuss.",
    keywords = "deep learning, energy, machine learning, sustainability, green computing",
    note = "This is a pre-copy-editing, author-produced PDF of an article accepted for publication in ITNow following peer review. The definitive publisher-authenticated versionDamian Borowiec, Richard R Harper, Peter Garraghan, The environmental consequence of deep learning, ITNOW, Volume 63, Issue 4, Winter 2021, Pages 10–11, https://doi.org/10.1093/itnow/bwab099 is available online at: https://academic.oup.com/itnow/article-abstract/63/4/10/6503628",
    year = "2022",
    month = "January",
    day = "11",
    doi = "10.1093/itnow/bwab099",
    language = "English",
    volume = "63",
    pages = "10--11",
    journal = "ITNOW",
    issn = "1746-5702",
    publisher = "Oxford University Press",
    number = "4",
    pdf = ""
    }


2021

  • [5] J. Gardiner, A. Eiffert, P. Garraghan, N. Race, S. Nagaraja, and A. Rashid, “Controller-in-the-Middle: Attacks on Software Defined Networks in Industrial Control Systems,” in CPSIoTSec ’21, 2021, p. 63–68. doi:10.1145/3462633.3483979
    [BibTeX] [Abstract] [pdf][Download PDF]

    Programmable networks are an area of increasing research activity and real-world usage. The most common example of programmable networks is software defined networking (SDN), in which the control and data planes are separated, with switches only acting as forwarding devices, controlled by software in the form of an SDN controller. As well as routing, this controller can perform other network functions such as load balancing and firewalls. There is an increasing amount of work proposing the use of SDN in industrial control systems (ICS) environments. The ability of SDN to dynamically control the network provides many potential benefits, including to security, utilising the dynamic orchestration of security controls. However, the centralisation of network control results in a single point of failure within the system, and thus potentially a major target of attack. An attacker who is capable of controlling the SDN controller gains near full control of the network. In this paper, we describe and analyse this very scenario. We demonstrate a number of simple, yet highly effective, attacks from a compromised SDN controller within an ICS environment which can break the real-time properties of industrial protocols, and potentially interfere with the operation of physical processes.

    @inproceedings{82080c4bf7784d70a3bee35022838e41,
    author = "Gardiner, Joe and Eiffert, Adam and Garraghan, Peter and Race, Nicholas and Nagaraja, Shishir and Rashid, Awais",
    title = "Controller-in-the-Middle: Attacks on Software Defined Networks in Industrial Control Systems",
    abstract = "Programmable networks are an area of increasing research activity and real-world usage. The most common example of programmable networks is software defined networking (SDN), in which the control and data planes are separated, with switches only acting as forwarding devices, controlled by software in the form of an SDN controller. As well as routing, this controller can perform other network functions such as load balancing and firewalls. There is an increasing amount of work proposing the use of SDN in industrial control systems (ICS) environments. The ability of SDN to dynamically control the network provides many potential benefits, including to security, utilising the dynamic orchestration of security controls. However, the centralisation of network control results in a single point of failure within the system, and thus potentially a major target of attack. An attacker who is capable of controlling the SDN controller gains near full control of the network. In this paper, we describe and analyse this very scenario. We demonstrate a number of simple, yet highly effective, attacks from a compromised SDN controller within an ICS environment which can break the real-time properties of industrial protocols, and potentially interfere with the operation of physical processes.",
    note = "{\textcopyright} ACM, 2021. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in CPSIoTSec '21 http://doi.acm.org/10.1145/3462633.3483979",
    year = "2021",
    month = "November",
    day = "30",
    doi = "10.1145/3462633.3483979",
    language = "English",
    series = "Joint Workshop on CPS \& IoT Security and Privacy (CPSIoTSec)",
    publisher = "ACM",
    pages = "63--68",
    booktitle = "CPSIoTSec '21",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/348539112/Impact\_of\_SDN\_in\_ICS\_author.pdf"
    }


  • [6] S. Tuli, S. {Singh Gill}, P. Garraghan, R. Buyya, G. Casale, and {. R. }. Jennings, “START:: Straggler Prediction and Mitigation for Cloud Computing Environments using Encoder LSTM Networks,” IEEE Transactions on Services Computing, p. 1–1, 2021. doi:10.1109/TSC.2021.3129897
    [BibTeX] [Abstract] [pdf][Download PDF]

    Modern large-scale computing systems distribute jobs into multiple smaller tasks which execute in parallel to accelerate job completion rates and reduce energy consumption. However, a common performance problem in such systems is dealing with straggler tasks that are slow running instances that increase the overall response time. Such tasks can significantly impact the system{’}s Quality of Service (QoS) and the Service Level Agreements (SLA). To combat this issue, there is a need for automatic straggler detection andmitigation mechanisms that execute jobs without violating the SLA. Prior work typically builds reactive models that focus first on detection and then mitigation of straggler tasks, which leads to delays. Other works use prediction based proactive mechanisms, but ignore heterogeneous host or volatile task characteristics. In this paper, we propose a Straggler Prediction and Mitigation Technique (START) that is able to predict which tasks might be stragglers and dynamically adapt scheduling to achieve lower response times. Our technique analyzes all tasks and hosts based on compute and network resource consumption using an Encoder Long-Short-Term-Memory (LSTM) network. The output of this network is then used to predict and mitigate expected straggler tasks. This reduces the SLA violation rate and execution time without compromising QoS. Specifically, we use the CloudSim toolkit to simulate START in a cloud environment and compare it with state-of-the-art techniques (IGRU-SD, SGC, Dolly, GRASS, NearestFit and Wrangler) in terms of QoS parameters such as energy consumption, execution time, resource contention, CPU utilization and SLA violation rate. Experiments show that START reduces execution time, resource contention, energy and SLA violations by 13\%, 11\%, 16\% and 19\%, respectively, compared to the state-of-the-art approaches.

    @article{9ba5a85a7b764d0fb5e84fdfcc6237a5,
    author = "Tuli, Shreshth and {Singh Gill}, Sukhpal and Garraghan, Peter and Buyya, Rajkumar and Casale, Giuliano and Jennings, {Nicholas R.}",
    title = "START:: Straggler Prediction and Mitigation for Cloud Computing Environments using Encoder LSTM Networks",
    abstract = "Modern large-scale computing systems distribute jobs into multiple smaller tasks which execute in parallel to accelerate job completion rates and reduce energy consumption. However, a common performance problem in such systems is dealing with straggler tasks that are slow running instances that increase the overall response time. Such tasks can significantly impact the system{\textquoteright}s Quality of Service (QoS) and the Service Level Agreements (SLA). To combat this issue, there is a need for automatic straggler detection andmitigation mechanisms that execute jobs without violating the SLA. Prior work typically builds reactive models that focus first on detection and then mitigation of straggler tasks, which leads to delays. Other works use prediction based proactive mechanisms, but ignore heterogeneous host or volatile task characteristics. In this paper, we propose a Straggler Prediction and Mitigation Technique (START) that is able to predict which tasks might be stragglers and dynamically adapt scheduling to achieve lower response times. Our technique analyzes all tasks and hosts based on compute and network resource consumption using an Encoder Long-Short-Term-Memory (LSTM) network. The output of this network is then used to predict and mitigate expected straggler tasks. This reduces the SLA violation rate and execution time without compromising QoS. Specifically, we use the CloudSim toolkit to simulate START in a cloud environment and compare it with state-of-the-art techniques (IGRU-SD, SGC, Dolly, GRASS, NearestFit and Wrangler) in terms of QoS parameters such as energy consumption, execution time, resource contention, CPU utilization and SLA violation rate. Experiments show that START reduces execution time, resource contention, energy and SLA violations by 13\%, 11\%, 16\% and 19\%, respectively, compared to the state-of-the-art approaches.",
    keywords = "Straggler, Deep Learning, Cloud computing, Prediction",
    note = "{\textcopyright}2021 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.",
    year = "2021",
    month = "November",
    day = "23",
    doi = "10.1109/TSC.2021.3129897",
    language = "English",
    pages = "1--1",
    journal = "IEEE Transactions on Services Computing",
    issn = "1939-1374",
    publisher = "Institute of Electrical and Electronics Engineers",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/341914911/TSC\_START\_accepted\_.pdf"
    }


  • [7] D. Lindsay, G. Yeung, Y. Elkhatib, and P. Garraghan, “An Empirical Study of Inter-cluster Resource Orchestration within Federated Cloud Clusters,” in 2021 IEEE International Conference on Joint Cloud Computing (JCC), 2021. doi:10.1109/JCC53141.2021.00019
    [BibTeX] [Abstract] [pdf][Download PDF]

    Federated clusters are composed of multiple independent clusters of machines interconnected by a resource management system, and possess several advantages over centralized cloud datacenter clusters including seamless provisioning of applications across large geographic regions, greater fault tolerance, and increased cluster resource utilization. However, while existing resource management systems for federated clusters are capable of improving application intra-cluster performance, they do not capture inter-cluster performance in their decision making. This is important given federated clusters must execute a wide variety of applications possessing heterogeneous system architectures, which are a impacted by unique inter-cluster performance conditions such as network latency and localized cluster resource contention. In this work we present an empirical study demonstrating how inter-cluster performance conditions negatively impact federated cluster orchestration systems. We conduct a series of micro-benchmarks under various cluster operational scenarios showing the critical importance in capturing inter-cluster performance for resource orchestration in federated clusters. From this benchmark, we determine precise limitations in existing federated orchestration, and highlight key insights to design future orchestration systems. Findings of notable interest entail different application types exhibiting innate performance affinities across various federated cluster operational conditions, and experience substantial performance degradation from even minor increases to latency (8.7x) and resource contention (12.0x) in comparison to centralized cluster architectures.

    @inproceedings{f5052b92c4874c45904bf54076cee739,
    author = "Lindsay, Dominic and Yeung, Ging-Fung and Elkhatib, Yehia and Garraghan, Peter",
    title = "An Empirical Study of Inter-cluster Resource Orchestration within Federated Cloud Clusters",
    abstract = "Federated clusters are composed of multiple independent clusters of machines interconnected by a resource management system, and possess several advantages over centralized cloud datacenter clusters including seamless provisioning of applications across large geographic regions, greater fault tolerance, and increased cluster resource utilization. However, while existing resource management systems for federated clusters are capable of improving application intra-cluster performance, they do not capture inter-cluster performance in their decision making. This is important given federated clusters must execute a wide variety of applications possessing heterogeneous system architectures, which are a impacted by unique inter-cluster performance conditions such as network latency and localized cluster resource contention. In this work we present an empirical study demonstrating how inter-cluster performance conditions negatively impact federated cluster orchestration systems. We conduct a series of micro-benchmarks under various cluster operational scenarios showing the critical importance in capturing inter-cluster performance for resource orchestration in federated clusters. From this benchmark, we determine precise limitations in existing federated orchestration, and highlight key insights to design future orchestration systems. Findings of notable interest entail different application types exhibiting innate performance affinities across various federated cluster operational conditions, and experience substantial performance degradation from even minor increases to latency (8.7x) and resource contention (12.0x) in comparison to centralized cluster architectures.",
    note = "{\textcopyright}2021 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.",
    year = "2021",
    month = "October",
    day = "13",
    doi = "10.1109/JCC53141.2021.00019",
    language = "English",
    isbn = "9781665434805",
    booktitle = "2021 IEEE International Conference on Joint Cloud Computing (JCC)",
    publisher = "IEEE",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/333221136/An\_Empirical\_Study\_of\_Inter\_cluster\_Resource\_Orchestration\_within\_Federated\_Cloud\_Clusters.pdf"
    }


  • [8] D. Lindsay, S. {Singh Gill}, D. Smirnova, and P. Garraghan, “The evolution of distributed computing systems: from fundamental to new frontiers,” Computing, vol. 103, iss. 8, p. 1859–1878, 2021. doi:10.1007/s00607-020-00900-y
    [BibTeX] [Abstract] [pdf][Download PDF]

    Distributed systems have been an active field of research for over 60 years, and has played a crucial role in computer science, enabling the invention of the Internet that underpins all facets of modern life. Through technological advancements and their changing role in society, distributed systems have undergone a perpetual evolution, with each change resulting in the formation of a new paradigm. Each new distributed system paradigm—of which modern prominence include cloud computing, Fog computing, and the Internet of Things (IoT)—allows for new forms of commercial and artistic value, yet also ushers in new research challenges that must be addressed in order to realize and enhance their operation. However, it is necessary to precisely identify what factors drive the formation and growth of a paradigm, and how unique are the research challenges within modern distributed systems in comparison to prior generations of systems. The objective of this work is to study and evaluate the key factors that have influenced and driven the evolution of distributed system paradigms, from early mainframes, inception of the global inter-network, and to present contemporary systems such as edge computing, Fog computing and IoT. Our analysis highlights assumptions that have driven distributed systems appear to be changing, including (1) an accelerated fragmentation of paradigms driven by commercial interests and physical limitations imposed by the end of Moore{’}s law, (2) a transition away from generalized architectures and frameworks towards increasing specialization, and (3) each paradigm architecture results in some form of pivoting between centralization and decentralization coordination. Finally, we discuss present day and future challenges of distributed research pertaining to studying complex phenomena at scale and the role of distributed systems research in the context of climate change.

    @article{fa06e7cd42a84d3d8fbf824e9cefb8d3,
    author = "Lindsay, Dominic and {Singh Gill}, Sukhpal and Smirnova, Daria and Garraghan, Peter",
    title = "The evolution of distributed computing systems: from fundamental to new frontiers",
    abstract = "Distributed systems have been an active field of research for over 60 years, and has played a crucial role in computer science, enabling the invention of the Internet that underpins all facets of modern life. Through technological advancements and their changing role in society, distributed systems have undergone a perpetual evolution, with each change resulting in the formation of a new paradigm. Each new distributed system paradigm—of which modern prominence include cloud computing, Fog computing, and the Internet of Things (IoT)—allows for new forms of commercial and artistic value, yet also ushers in new research challenges that must be addressed in order to realize and enhance their operation. However, it is necessary to precisely identify what factors drive the formation and growth of a paradigm, and how unique are the research challenges within modern distributed systems in comparison to prior generations of systems. The objective of this work is to study and evaluate the key factors that have influenced and driven the evolution of distributed system paradigms, from early mainframes, inception of the global inter-network, and to present contemporary systems such as edge computing, Fog computing and IoT. Our analysis highlights assumptions that have driven distributed systems appear to be changing, including (1) an accelerated fragmentation of paradigms driven by commercial interests and physical limitations imposed by the end of Moore{\textquoteright}s law, (2) a transition away from generalized architectures and frameworks towards increasing specialization, and (3) each paradigm architecture results in some form of pivoting between centralization and decentralization coordination. Finally, we discuss present day and future challenges of distributed research pertaining to studying complex phenomena at scale and the role of distributed systems research in the context of climate change.",
    keywords = "Distributed Computing, Computing Systems, Evolution, Green Computing",
    note = "The final publication is available at Springer via http://dx.doi.org/10.1007/s00607-020-00900-y",
    year = "2021",
    month = "August",
    day = "31",
    doi = "10.1007/s00607-020-00900-y",
    language = "English",
    volume = "103",
    pages = "1859--1878",
    journal = "Computing",
    issn = "0010-485X",
    publisher = "Springer Wien",
    number = "8",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/318922691/COMP\_D\_20\_00070\_R2\_Camera\_Ready\_.pdf"
    }


2020

  • [9] Y. Yu, V. Jindal, I. Yen, F. Bastani, J. Xu, and P. Garraghan, “Integrating Clustering and Regression for Workload Estimation in the Cloud,” Concurrency and Computation Practice and Experience, vol. 32, iss. 23, 2020. doi:10.1002/cpe.5931
    [BibTeX] [Abstract] [pdf][Download PDF]

    Workload prediction has been widely researched in the literature. However, existing techniques are per‐job based and useful for service‐like tasks whose workloads exhibit seasonality and trend. But cloud jobs have many different workload patterns and some do not exhibit recurring workload patterns. We consider job‐pool‐based workload estimation, which analyzes the characteristics of existing tasks’ workloads to estimate the currently running tasks’ workload. First cluster existing tasks based on their workloads. For a new task J, collect the initial workload of J and determine which cluster J may belong to, then use the cluster’s characteristics to estimate J′s workload. Based on the Google dataset, the algorithm is experimentally evaluated and its effectiveness is confirmed. However, the workload patterns of some tasks do have seasonality and trend, and conventional per‐job‐based regression methods may yield better workload prediction results. Also, in some cases, some new tasks may not follow the workload patterns of existing tasks in the pool. Thus, develop an integrated scheme which combines clustering and regression and utilize the best of them for workload prediction. Experimental study shows that the combined approach can further improve the accuracy of workload prediction.

    @article{e60410b9c3444b0aabc341d67a98bd73,
    author = "Yu, Yongjia and Jindal, Vasu and Yen, I-Ling and Bastani, Farokh and Xu, Jie and Garraghan, Peter",
    title = "Integrating Clustering and Regression for Workload Estimation in the Cloud",
    abstract = "Workload prediction has been widely researched in the literature. However, existing techniques are per‐job based and useful for service‐like tasks whose workloads exhibit seasonality and trend. But cloud jobs have many different workload patterns and some do not exhibit recurring workload patterns. We consider job‐pool‐based workload estimation, which analyzes the characteristics of existing tasks' workloads to estimate the currently running tasks' workload. First cluster existing tasks based on their workloads. For a new task J, collect the initial workload of J and determine which cluster J may belong to, then use the cluster's characteristics to estimate J′s workload. Based on the Google dataset, the algorithm is experimentally evaluated and its effectiveness is confirmed. However, the workload patterns of some tasks do have seasonality and trend, and conventional per‐job‐based regression methods may yield better workload prediction results. Also, in some cases, some new tasks may not follow the workload patterns of existing tasks in the pool. Thus, develop an integrated scheme which combines clustering and regression and utilize the best of them for workload prediction. Experimental study shows that the combined approach can further improve the accuracy of workload prediction.",
    note = "This is the peer reviewed version of the following article: Yu, Y, Jindal, V, Yen, I‐L, Bastani, F, Xu, J, Garraghan, P. Integrating clustering and regression for workload estimation in the cloud. Concurrency Computat Pract Exper. 2020; e5931. https://doi.org/10.1002/cpe.5931 which has been published in final form at https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.5931 This article may be used for non-commercial purposes in accordance With Wiley Terms and Conditions for self-archiving.",
    year = "2020",
    month = "December",
    day = "10",
    doi = "10.1002/cpe.5931",
    language = "English",
    volume = "32",
    journal = "Concurrency and Computation Practice and Experience",
    issn = "1532-0626",
    publisher = "John Wiley and Sons Ltd",
    number = "23",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/298427375/Cloud\_Workload\_Prediction.pdf"
    }


  • [10] S. {Singh Gill}, X. Ouyang, and P. Garraghan, “Tails in the cloud: a survey and taxonomy of straggler management within large‑scale cloud data centres,” Journal of Supercomputing, vol. 76, p. 10050–10089, 2020. doi:10.1007/s11227-020-03241-x
    [BibTeX] [Abstract] [pdf][Download PDF]

    Cloud computing systems are splitting compute- and data-intensive jobs into smaller tasks to execute them in a parallel manner using clusters to improve execution time. However, such systems at increasing scale are exposed to stragglers, whereby abnormally slow running tasks executing within a job substantially affect job performance completion. Such stragglers are a direct threat towards attaining fast execution of data-intensive jobs within cloud computing. Researchers have proposed an assortment of different mechanisms, frameworks, and management techniques to detect and mitigate stragglers both proactively and reactively. In this paper, we present a comprehensive review of straggler management techniques within large-scale cloud data centres. We provide a detailed taxonomy of straggler causes, as well as proposed management and mitigation techniques based on straggler characteristics and properties. From this systematic review, we outline several outstanding challenges and potential directions of possible future work for straggler research.

    @article{d7b52525af014ec4a08bd88a8e656765,
    author = "{Singh Gill}, Sukhpal and Ouyang, Xue and Garraghan, Peter",
    title = "Tails in the cloud: a survey and taxonomy of straggler management within large‑scale cloud data centres",
    abstract = "Cloud computing systems are splitting compute- and data-intensive jobs into smaller tasks to execute them in a parallel manner using clusters to improve execution time. However, such systems at increasing scale are exposed to stragglers, whereby abnormally slow running tasks executing within a job substantially affect job performance completion. Such stragglers are a direct threat towards attaining fast execution of data-intensive jobs within cloud computing. Researchers have proposed an assortment of different mechanisms, frameworks, and management techniques to detect and mitigate stragglers both proactively and reactively. In this paper, we present a comprehensive review of straggler management techniques within large-scale cloud data centres. We provide a detailed taxonomy of straggler causes, as well as proposed management and mitigation techniques based on straggler characteristics and properties. From this systematic review, we outline several outstanding challenges and potential directions of possible future work for straggler research.",
    keywords = "Computing, Stragglers, Cloud computing, Straggler management, Distributed systems, Cloud data centres",
    note = "The final publication is available at Springer via http://dx.doi.org/10.1007/s11227-020-03241-x",
    year = "2020",
    month = "December",
    day = "1",
    doi = "10.1007/s11227-020-03241-x",
    language = "English",
    volume = "76",
    pages = "10050–10089",
    journal = "Journal of Supercomputing",
    issn = "0920-8542",
    publisher = "Springer Netherlands",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/292772933/SUPE\_D\_20\_00042.R1.pdf"
    }


  • [11] P. Terenius, P. Garraghan, and R. H. R. Harper, “Using data centre waste heat to dry coffee whilst supplying small-scale farmers with ICT: a case study based on a novel systems-based approach.” 2020, p. 1–17.
    [BibTeX] [Abstract]

    In light of the current climate crisis, a holistic approach to infrastructural matters regarding energy, communication, data and sustainable communities, as well as the water-food-energy nexus in general, is critical. One enabler for building sustainable communities around the Globe is ICT (information and communications technology). In the near future, the number of ICT systems will expand significantly in warm parts of the world, because of larger populations and increased relative wealth.As the backbone of ICT, data centres and mobile networks consume up to a few per cent of the world{’}s electrical energy, energy ending up as waste heat. In cold areas, the waste heat is sometimes reused to heat buildings. However, hitherto excessive heat has not been given much thought in regards to warm countries. In our research, we address waste heat from these systems, to reuse perhaps one or two per cent of the world{’}s future electrical energy. The relatively low outgoing temperature of a data centre{’}s airflow makes turning heat to electricity a non-viable option, as energy conversion losses would be massive. Hence, we focus on secondary uses for hot air.Based on a systems science approach, one of the themes we currently explore involves coffee drying. Many low- and mid-income countries are producing coffee, which needs drying as part of its production process. In some regions, coffee beans can be sun-dried, but other areas are too humid. In those cases, drying is commonly carried out using electricity-powered machinery. For a drying facility, the prospects of instead using waste heat to dry coffee are appealing. Conversely, if the presence of a drying facility in a community may be powered by waste heat, this may call for small-scale data centre construction, in turn increasing ICT availability locally or regionally.In other words, there is a bond between environmental gains and sustainable growth of a community. We are therefore investigating not only environmental but also societal benefits of this idea. For example, our approach gives more power to local producers of sustainable coffee: drying coffee beans close to source and then, through ICT, take a more active part in the supply chain may massively increase the profit for local farmers or collective efforts.Through a site selection based on a newly developed index, we have chosen Costa Rica for our case study, and arrived to an estimate for data centre waste heat drying capability in that country. We also discuss our findings in relation to the UN Sustainable Development Goals (SDGs).Due to the complexity of this project, it is too early to say to what extent data centre waste heat can indeed be used in these specific circumstances. Still, as coffee drying is achieved in different manners depending on topography, humidity, social structures, legislation and tradition, the innovative approach may have merit in some low- and mid-income country contexts.

    @conference{b44717efb10c47a1a5a4734169b1fd53,
    author = "Terenius, Petter and Garraghan, Peter and Harper, R.H.R.",
    title = "Using data centre waste heat to dry coffee whilst supplying small-scale farmers with ICT: a case study based on a novel systems-based approach",
    abstract = "In light of the current climate crisis, a holistic approach to infrastructural matters regarding energy, communication, data and sustainable communities, as well as the water-food-energy nexus in general, is critical. One enabler for building sustainable communities around the Globe is ICT (information and communications technology). In the near future, the number of ICT systems will expand significantly in warm parts of the world, because of larger populations and increased relative wealth.As the backbone of ICT, data centres and mobile networks consume up to a few per cent of the world{\textquoteright}s electrical energy, energy ending up as waste heat. In cold areas, the waste heat is sometimes reused to heat buildings. However, hitherto excessive heat has not been given much thought in regards to warm countries. In our research, we address waste heat from these systems, to reuse perhaps one or two per cent of the world{\textquoteright}s future electrical energy. The relatively low outgoing temperature of a data centre{\textquoteright}s airflow makes turning heat to electricity a non-viable option, as energy conversion losses would be massive. Hence, we focus on secondary uses for hot air.Based on a systems science approach, one of the themes we currently explore involves coffee drying. Many low- and mid-income countries are producing coffee, which needs drying as part of its production process. In some regions, coffee beans can be sun-dried, but other areas are too humid. In those cases, drying is commonly carried out using electricity-powered machinery. For a drying facility, the prospects of instead using waste heat to dry coffee are appealing. Conversely, if the presence of a drying facility in a community may be powered by waste heat, this may call for small-scale data centre construction, in turn increasing ICT availability locally or regionally.In other words, there is a bond between environmental gains and sustainable growth of a community. We are therefore investigating not only environmental but also societal benefits of this idea. For example, our approach gives more power to local producers of sustainable coffee: drying coffee beans close to source and then, through ICT, take a more active part in the supply chain may massively increase the profit for local farmers or collective efforts.Through a site selection based on a newly developed index, we have chosen Costa Rica for our case study, and arrived to an estimate for data centre waste heat drying capability in that country. We also discuss our findings in relation to the UN Sustainable Development Goals (SDGs).Due to the complexity of this project, it is too early to say to what extent data centre waste heat can indeed be used in these specific circumstances. Still, as coffee drying is achieved in different manners depending on topography, humidity, social structures, legislation and tradition, the innovative approach may have merit in some low- and mid-income country contexts.",
    keywords = "sustainability, systems science, Data centres, coffee, energy, Costa Rica",
    year = "2020",
    month = "November",
    day = "21",
    language = "English",
    pages = "1--17",
    note = "2020 International Conference on Sustainable Development , ICSD ; Conference date: 21-09-2020 Through 22-09-2020",
    pdf = ""
    }


  • [12] A. Saeed, P. Garraghan, and S. {Asad Hussain}, “Cross-VM Network Channel Attacks and Countermeasures within Cloud Computing Environments,” IEEE Transactions on Dependable and Secure Computing, 2020. doi:10.1109/TDSC.2020.3037022
    [BibTeX] [Abstract] [pdf][Download PDF]

    Cloud providers attempt to maintain the highest levels of isolation between Virtual Machines (VMs) and inter-user processes to keep co-located VMs and processes separate. This logical isolation creates an internal virtual network to separate VMs co-residing within a shared physical network. However, as co-residing VMs share their underlying VMM (Virtual Machine Monitor), virtual network, and hardware are susceptible to cross VM attacks. It is possible for a malicious VM to potentially access or control other VMs through network connections, shared memory, other shared resources, or by gaining the privilege level of its non-root machine. This research presents a two novel zero-day cross-VM network channel attacks. In the first attack, a malicious VM can redirect the network traffic of target VMs to a specific destination by impersonating the Virtual Network Interface Controller (VNIC). The malicious VM can extract the decrypted information from target VMs by using open source decryption tools such as Aircrack. The second contribution of this research is a privilege escalation attack in a cross VM cloud environment with Xen hypervisor. An adversary having limited privileges rights may execute Return-Oriented Programming (ROP), establish a connection with the root domain by exploiting the network channel, and acquiring the tool stack (root domain) which it is not authorized to access directly. Countermeasures against this attacks are also presented

    @article{f694584cc3eb44faa6567f6f7d405fdc,
    author = "Saeed, Atif and Garraghan, Peter and {Asad Hussain}, Syed",
    title = "Cross-VM Network Channel Attacks and Countermeasures within Cloud Computing Environments",
    abstract = "Cloud providers attempt to maintain the highest levels of isolation between Virtual Machines (VMs) and inter-user processes to keep co-located VMs and processes separate. This logical isolation creates an internal virtual network to separate VMs co-residing within a shared physical network. However, as co-residing VMs share their underlying VMM (Virtual Machine Monitor), virtual network, and hardware are susceptible to cross VM attacks. It is possible for a malicious VM to potentially access or control other VMs through network connections, shared memory, other shared resources, or by gaining the privilege level of its non-root machine. This research presents a two novel zero-day cross-VM network channel attacks. In the first attack, a malicious VM can redirect the network traffic of target VMs to a specific destination by impersonating the Virtual Network Interface Controller (VNIC). The malicious VM can extract the decrypted information from target VMs by using open source decryption tools such as Aircrack. The second contribution of this research is a privilege escalation attack in a cross VM cloud environment with Xen hypervisor. An adversary having limited privileges rights may execute Return-Oriented Programming (ROP), establish a connection with the root domain by exploiting the network channel, and acquiring the tool stack (root domain) which it is not authorized to access directly. Countermeasures against this attacks are also presented",
    keywords = "Cloud computing, Security, Cyber-security, Cloud security",
    note = "{\textcopyright}2020 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.",
    year = "2020",
    month = "November",
    day = "10",
    doi = "10.1109/TDSC.2020.3037022",
    language = "English",
    journal = "IEEE Transactions on Dependable and Secure Computing",
    issn = "1545-5971",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/309546598/Cross\_VM\_cloud\_attacks\_Final\_1\_.pdf"
    }


  • [13] G. Yeung, D. Borowiec, R. Yang, A. Friday, R. H. R. Harper, and P. Garraghan, “Horus: An Interference-aware Resource Manager for Deep Learning Systems,” in Algorithms and Architectures for Parallel Processing. ICA3PP 2020, 2020, p. 492–508. doi:10.1007/978-3-030-60239-0_33
    [BibTeX] [Abstract] [pdf][Download PDF]

    Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems – ranging from a singular GPU device to machine clusters – require state-of-the-art resource management to increase resource utilization and job throughput. While it has been identified that co-location – multiple jobs co-located within the same GPU – is an effective means to achieve this, such co-location incurs performance interference that directly debilitates DL training and inference performance. Existing approaches to mitigate interference require resource intensive and time consuming kernel profiling ill-suited for runtime scheduling decisions. Current DL system resource management are not designed to deal with these problems. This paper proposes Horus, an interference-aware resource manager for DL systems. Instead of leveraging expensive kernel-profiling, our approach estimates job resource utilization and co-location patterns to determine effective DL job placement to minimize likelihood of interference, as well as improve system resource utilization and makespan. Our analysis shows that interference cause up to 3.2x DL job slowdown. We integrated our approach within the Kubernetes resource manager, and conduct experiments in a DL cluster by training 2,500 DL jobs using 13 different models types. Results demonstrate that Horus is able to outperform other DL resource managers by up to 61.5\% for resource utilization and 33.6\% for makespan.

    @inproceedings{424f183a402a49379d52cda3496cfa8b,
    author = "Yeung, Gingfung and Borowiec, Damian and Yang, Renyu and Friday, Adrian and Harper, R.H.R. and Garraghan, Peter",
    editor = "Qiu, M.",
    title = "Horus: An Interference-aware Resource Manager for Deep Learning Systems",
    abstract = "Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - ranging from a singular GPU device to machine clusters - require state-of-the-art resource management to increase resource utilization and job throughput. While it has been identified that co-location - multiple jobs co-located within the same GPU - is an effective means to achieve this, such co-location incurs performance interference that directly debilitates DL training and inference performance. Existing approaches to mitigate interference require resource intensive and time consuming kernel profiling ill-suited for runtime scheduling decisions. Current DL system resource management are not designed to deal with these problems. This paper proposes Horus, an interference-aware resource manager for DL systems. Instead of leveraging expensive kernel-profiling, our approach estimates job resource utilization and co-location patterns to determine effective DL job placement to minimize likelihood of interference, as well as improve system resource utilization and makespan. Our analysis shows that interference cause up to 3.2x DL job slowdown. We integrated our approach within the Kubernetes resource manager, and conduct experiments in a DL cluster by training 2,500 DL jobs using 13 different models types. Results demonstrate that Horus is able to outperform other DL resource managers by up to 61.5\% for resource utilization and 33.6\% for makespan.",
    keywords = "Machine Learning Systems, Performance Interference, Deep Learning, GPU Scheduling, Cluster resource management",
    year = "2020",
    month = "September",
    day = "29",
    doi = "10.1007/978-3-030-60239-0\_33",
    language = "English",
    isbn = "9783030602383",
    series = "Lecture Notes in Computer Science",
    publisher = "Springer",
    pages = "492--508",
    booktitle = "Algorithms and Architectures for Parallel Processing. ICA3PP 2020",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/301059432/ICA3PP\_Horus\_Yeung\_Accepted\_.pdf"
    }


  • [14] S. S. Gill, S. Tuli, A. N. Toosi, F. Cuadrado, P. Garraghan, R. Bahsoon, H. Lutfiyya, R. Sakellariou, O. Rana, S. Dustdar, and R. Buyya, “ThermoSim: Deep learning based framework for modeling and simulation of thermal-aware resource management for cloud computing environments,” Journal of Systems and Software, vol. 166, 2020. doi:10.1016/j.jss.2020.110596
    [BibTeX] [Abstract] [pdf][Download PDF]

    Current cloud computing frameworks host millions of physical servers that utilize cloud computing resources in the form of different virtual machines. Cloud Data Center (CDC) infrastructures require significant amounts of energy to deliver large scale computational services. Moreover, computing nodes generate large volumes of heat, requiring cooling units in turn to eliminate the effect of this heat. Thus, overall energy consumption of the CDC increases tremendously for servers as well as for cooling units. However, current workload allocation policies do not take into account effect on temperature and it is challenging to simulate the thermal behavior of CDCs. There is a need for a thermal-aware framework to simulate and model the behavior of nodes and measure the important performance parameters which can be affected by its temperature. In this paper, we propose a lightweight framework, ThermoSim, for modeling and simulation of thermal-aware resource management for cloud computing environments. This work presents a Recurrent Neural Network based deep learning temperature predictor for CDCs which is utilized by ThermoSim for lightweight resource management in constrained cloud environments. ThermoSim extends the CloudSim toolkit helping to analyze the performance of various key parameters such as energy consumption, service level agreement violation rate, number of virtual machine migrations and temperature during the management of cloud resources for execution of workloads. Further, different energy-aware and thermal-aware resource management techniques are tested using the proposed ThermoSim framework in order to validate it against the existing framework (Thas). The experimental results demonstrate the proposed framework is capable of modeling and simulating the thermal behavior of a CDC and ThermoSim framework is better than Thas in terms of energy consumption, cost, time, memory usage and prediction accuracy.

    @article{9ad35f32ea8a4322807504162803f789,
    author = "Gill, S.S. and Tuli, S. and Toosi, A.N. and Cuadrado, F. and Garraghan, P. and Bahsoon, R. and Lutfiyya, H. and Sakellariou, R. and Rana, O. and Dustdar, S. and Buyya, R.",
    title = "ThermoSim: Deep learning based framework for modeling and simulation of thermal-aware resource management for cloud computing environments",
    abstract = "Current cloud computing frameworks host millions of physical servers that utilize cloud computing resources in the form of different virtual machines. Cloud Data Center (CDC) infrastructures require significant amounts of energy to deliver large scale computational services. Moreover, computing nodes generate large volumes of heat, requiring cooling units in turn to eliminate the effect of this heat. Thus, overall energy consumption of the CDC increases tremendously for servers as well as for cooling units. However, current workload allocation policies do not take into account effect on temperature and it is challenging to simulate the thermal behavior of CDCs. There is a need for a thermal-aware framework to simulate and model the behavior of nodes and measure the important performance parameters which can be affected by its temperature. In this paper, we propose a lightweight framework, ThermoSim, for modeling and simulation of thermal-aware resource management for cloud computing environments. This work presents a Recurrent Neural Network based deep learning temperature predictor for CDCs which is utilized by ThermoSim for lightweight resource management in constrained cloud environments. ThermoSim extends the CloudSim toolkit helping to analyze the performance of various key parameters such as energy consumption, service level agreement violation rate, number of virtual machine migrations and temperature during the management of cloud resources for execution of workloads. Further, different energy-aware and thermal-aware resource management techniques are tested using the proposed ThermoSim framework in order to validate it against the existing framework (Thas). The experimental results demonstrate the proposed framework is capable of modeling and simulating the thermal behavior of a CDC and ThermoSim framework is better than Thas in terms of energy consumption, cost, time, memory usage and prediction accuracy.",
    keywords = "Cloud computing, Deep learning, Energy, Resource management, Simulation, Thermal-aware, Energy utilization, Environmental management, Natural resources management, Network security, Power management, Printing machinery, Recurrent neural networks, Resource allocation, Virtual machine, Cloud computing environments, Current cloud computing, Lightweight frameworks, Modeling and simulating, Performance parameters, Resource management techniques, Service Level Agreements, Virtual machine migrations, Green computing",
    note = "This is the author{\textquoteright}s version of a work that was accepted for publication in Journal of Systems and Software. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Journal of Systems and Software, 166, 2020 DOI: 10.1016/j.jss.2020.110596",
    year = "2020",
    month = "August",
    day = "1",
    doi = "10.1016/j.jss.2020.110596",
    language = "English",
    volume = "166",
    journal = "Journal of Systems and Software",
    issn = "0164-1212",
    publisher = "Elsevier Inc.",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/295977181/2004.08131.pdf"
    }


  • [15] J. Bulman and P. Garraghan, “A cloud gaming framework for dynamic graphical rendering towards achieving distributed game engines,” in The 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud ’20), 2020.
    [BibTeX] [Abstract] [pdf][Download PDF]

    Cloud gaming in recent years has gained growing success in delivering games-as-a-service by leveraging cloud resources. Existing cloud gaming frameworks deploy the entire game engine within Virtual Machines (VMs) due to the tight-coupling of game engine subsystems (graphics, physics, AI). The effectiveness of such an approach is heavily dependant on the cloud VM providing consistently high levels of performance, availability, and reliability. However this assumption is difficult to guarantee due to QoS degradation within, and outside of, the cloud – from system failure, network connectivity, to consumer datacaps – all of which may result in game service outage. We present a cloud gaming framework that creates a distributed game engine via loose-coupling the graphical renderer from the game engine, allowing for its execution across cloud VMs and client devices dynamically. Our framework allows games to operate during performance degradation and cloud service failure, enabling game developers to exploit heterogeneous graphical APIs unrestricted from Operating System and hardware constraints. Our initial experiments show that our framework improves game frame rates by up to 33\% via frame interlacing between cloud and client systems.

    @inproceedings{6067f5e845e347a2b80f0c518fc7bb91,
    author = "Bulman, James and Garraghan, Peter",
    title = "A cloud gaming framework for dynamic graphical rendering towards achieving distributed game engines",
    abstract = "Cloud gaming in recent years has gained growing success in delivering games-as-a-service by leveraging cloud resources. Existing cloud gaming frameworks deploy the entire game engine within Virtual Machines (VMs) due to the tight-coupling of game engine subsystems (graphics, physics, AI). The effectiveness of such an approach is heavily dependant on the cloud VM providing consistently high levels of performance, availability, and reliability. However this assumption is difficult to guarantee due to QoS degradation within, and outside of, the cloud - from system failure, network connectivity, to consumer datacaps - all of which may result in game service outage. We present a cloud gaming framework that creates a distributed game engine via loose-coupling the graphical renderer from the game engine, allowing for its execution across cloud VMs and client devices dynamically. Our framework allows games to operate during performance degradation and cloud service failure, enabling game developers to exploit heterogeneous graphical APIs unrestricted from Operating System and hardware constraints. Our initial experiments show that our framework improves game frame rates by up to 33\% via frame interlacing between cloud and client systems.",
    keywords = "Cloud computing, Cloud gaming, Gaming technologies",
    year = "2020",
    month = "May",
    day = "1",
    language = "English",
    booktitle = "The 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud '20)",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/297331374/HotCloud20\_Cloud\_Gaming.pdf"
    }


  • [16] G. Yeung, D. Borowiec, A. Friday, R. H. R. Harper, and P. Garraghan, “Towards GPU Utilization Prediction for Cloud Deep Learning,” in The 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud ’20), 2020.
    [BibTeX] [Abstract] [pdf][Download PDF]

    Understanding the GPU utilization of Deep Learning (DL) workloads is important for enhancing resource-efficiency and cost-benefit decision making for DL frameworks in the cloud. Current approaches to determine DL workload GPU utilization rely on online profiling within isolated GPU devices, and must be performed for every unique DL workload submission resulting in resource under-utilization and reduced service availability. In this paper, we propose a prediction engine to proactively determine the GPU utilization of heterogeneous DL workloads without the need for in-depth or isolated online profiling. We demonstrate that it is possible to predict DL workload GPU utilization via extracting information from its model computation graph. Our experiments show that the prediction engine achieves an RMSLE of 0.154, and can be exploited by DL schedulers to achieve up to 61.5\% improvement to GPU cluster utilization.

    @inproceedings{d42f70a84aca4404afea418569d286ab,
    author = "Yeung, Ging-Fung and Borowiec, Damian and Friday, Adrian and Harper, R.H.R. and Garraghan, Peter",
    title = "Towards GPU Utilization Prediction for Cloud Deep Learning",
    abstract = "Understanding the GPU utilization of Deep Learning (DL) workloads is important for enhancing resource-efficiency and cost-benefit decision making for DL frameworks in the cloud. Current approaches to determine DL workload GPU utilization rely on online profiling within isolated GPU devices, and must be performed for every unique DL workload submission resulting in resource under-utilization and reduced service availability. In this paper, we propose a prediction engine to proactively determine the GPU utilization of heterogeneous DL workloads without the need for in-depth or isolated online profiling. We demonstrate that it is possible to predict DL workload GPU utilization via extracting information from its model computation graph. Our experiments show that the prediction engine achieves an RMSLE of 0.154, and can be exploited by DL schedulers to achieve up to 61.5\% improvement to GPU cluster utilization.",
    year = "2020",
    month = "May",
    day = "1",
    language = "English",
    booktitle = "The 12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud '20)",
    publisher = "USENIX Association",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/297331283/HotCloud20\_GPU\_Prediction.pdf"
    }


  • [17] R. Yang, Z. Wen, D. McKee, T. Lin, J. Xu, and P. Garraghan, “Software-Defined Fog Orchestration for IoT Services,” in Fog and Fogonomics, Y. Yang, J. Huang, T. Zhang, and J. Weinman, Eds., John Wiley, 2020, p. 179–212.
    [BibTeX]
    @inbook{5f8632136efc40f8bebec4e69d47221f,
    author = "Yang, Renyu and Wen, Zhenyu and McKee, David and Lin, Tao and Xu, Jie and Garraghan, Peter",
    editor = "Yang, Yang and Huang, Jianwei and Zhang, Tao and Weinman, Joe",
    title = "Software-Defined Fog Orchestration for IoT Services",
    year = "2020",
    month = "March",
    day = "4",
    language = "English",
    isbn = "9781119501091",
    pages = "179--212",
    booktitle = "Fog and Fogonomics",
    publisher = "John Wiley",
    pdf = ""
    }


  • [18] R. Yang, Z. Wen, D. McKee, T. Lin, J. Xu, and P. Garraghan, “Fog Orchestration and Simulation for IoT Services,” in Fog and Fogonomics, Y. Yang, J. Huang, T. Zhang, and J. Weinman, Eds., Wiley, 2020, p. 179–212.
    [BibTeX] [Abstract] [pdf][Download PDF]

    The Internet of Things (IoT) interconnects physical objects including sensors, vehicles, and buildings into a virtual circumstance, resulting in the increasing integration of Cyber-physical objects. The Fog computing paradigm extends both computation and storage services in Cloud computing environment to the network edge. Typically, IoT services comprise of a set of software components running over different locations connected through datacenter or wireless sensor networks. It is significantly important and cost-effective to orchestrate and deploy a group of microservices onto Fog appliances such as edge devices or Cloud servers for the formation of such IoT services. In this chapter, we discuss the challenges of realizing Fog orchestration for IoT services, and present a software-defined orchestration architecture and simulation solutions to intelligently compose and orchestrate thousands of heterogeneous Fog appliances. The resource provisioning, component placement and runtime QoS control in the orchestration procedure can harness workload dynamicity, network uncertainty and security demands whilst considering different applications{’} requirement and appliances{’} capabilities. Our practical experiences show that the proposed parallelized orchestrator can reduce the execution time by 50\% with at least 30\% higher orchestration quality. We believe that our solution plays an important role in the current Fog ecosystem.

    @inbook{9c434213a47d45b486ebc9c416d132d0,
    author = "Yang, Renyu and Wen, Zhenyu and McKee, David and Lin, Tao and Xu, Jie and Garraghan, Peter",
    editor = "Yang, Yang and Huang, Jianwei and Zhang, Tao and Weinman, Joe",
    title = "Fog Orchestration and Simulation for IoT Services",
    abstract = "The Internet of Things (IoT) interconnects physical objects including sensors, vehicles, and buildings into a virtual circumstance, resulting in the increasing integration of Cyber-physical objects. The Fog computing paradigm extends both computation and storage services in Cloud computing environment to the network edge. Typically, IoT services comprise of a set of software components running over different locations connected through datacenter or wireless sensor networks. It is significantly important and cost-effective to orchestrate and deploy a group of microservices onto Fog appliances such as edge devices or Cloud servers for the formation of such IoT services. In this chapter, we discuss the challenges of realizing Fog orchestration for IoT services, and present a software-defined orchestration architecture and simulation solutions to intelligently compose and orchestrate thousands of heterogeneous Fog appliances. The resource provisioning, component placement and runtime QoS control in the orchestration procedure can harness workload dynamicity, network uncertainty and security demands whilst considering different applications{\textquoteright} requirement and appliances{\textquoteright} capabilities. Our practical experiences show that the proposed parallelized orchestrator can reduce the execution time by 50\% with at least 30\% higher orchestration quality. We believe that our solution plays an important role in the current Fog ecosystem.",
    year = "2020",
    month = "March",
    day = "1",
    language = "English",
    isbn = "1119501091",
    pages = "179--212",
    booktitle = "Fog and Fogonomics",
    publisher = "Wiley",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/250105360/Fog\_BookChapter\_final\_v1.0\_20181028.pdf"
    }


  • [19] P. Terenius, {. G}. Golmen, P. Garraghan, and R. H. R. Harper, “Heat energy from datacenters: an opportunity for marine energy.” 2020.
    [BibTeX] [Abstract]

    The world as we know it faces a severe threat from global warming. In November 2019, 11,000 scientists signed a warning on the effects of climate change, stating that the world “must quickly implement massive energy efficiency and conservation practices and must replace fossil fuels with low-carbon renewables”. Vital to our argument is that these words relate to both energy-saving measures and to energy production.This paper shows how a large actor in energy consumption – datacenters – can work together with a promising technology in renewable, marine, energy – ocean thermal energy conversion (OTEC) technology – to decrease the CO2 footprint of mankind while enabling sustainable growth.

    @conference{aee9be9bcad240e898b0feaa30226293,
    author = "Terenius, Petter and Golmen, {Lars G} and Garraghan, Peter and Harper, R.H.R.",
    title = "Heat energy from datacenters: an opportunity for marine energy",
    abstract = "The world as we know it faces a severe threat from global warming. In November 2019, 11,000 scientists signed a warning on the effects of climate change, stating that the world “must quickly implement massive energy efficiency and conservation practices and must replace fossil fuels with low-carbon renewables”. Vital to our argument is that these words relate to both energy-saving measures and to energy production.This paper shows how a large actor in energy consumption – datacenters – can work together with a promising technology in renewable, marine, energy – ocean thermal energy conversion (OTEC) technology – to decrease the CO2 footprint of mankind while enabling sustainable growth.",
    keywords = "Data centres, Marine energy, Energy, OTEC, ocean thermal energy conversion",
    year = "2020",
    month = "February",
    day = "26",
    language = "English",
    pdf = ""
    }


  • [20] R. Yang, X. Sun, C. Hu, P. Garraghan, T. Wo, Z. Wen, H. Peng, J. Xu, and C. Li, “Performance-aware Speculative Resource Oversubscription for Large-scale Clusters,” IEEE Transactions on Parallel and Distributed Systems, vol. 31, iss. 7, p. 1499–1517, 2020. doi:10.1109/TPDS.2020.2970013
    [BibTeX] [Abstract] [pdf][Download PDF]

    It is a long-standing challenge to achieve a high degree of resource utilization in cluster scheduling. Resource oversubscription has become a common practice in improving resource utilization and cost reduction. However, current centralizedapproaches to oversubscription suffer from the issue with resource mismatch and fail to take into account other performance requirements, e.g., tail latency. In this paper we present ROSE, a new resource management platform capable of conducting performance-aware resource oversubscription. ROSE allows latency-sensitive long-running applications (LRAs) to co-exist with computation-intensive batch jobs. Instead of waiting for resource allocation to be confirmed by the centralized scheduler, job managers in ROSE can independently request to launch speculative tasks within specific machines according to their suitability for oversubscription. Node agents of those machines can however avoid any excessive resource oversubscription by means of a mechanism for admission control using multi-resource threshold control and performance-aware resource throttle. Experiments show that in case of mixed co-location of batch jobs and latency-sensitive LRAs, the CPU utilization and the disk utilization can reach56.34\% and 43.49\%, respectively, but the 95th percentile of read latency in YCSB workloads only increases by 5.4\% against the case of executing the LRAs alone.

    @article{877e6f882b954a11a763062e83be3dbe,
    author = "Yang, Renyu and Sun, Xiaoyang and Hu, Chunming and Garraghan, Peter and Wo, Tianyu and Wen, Zhenyu and Peng, Hao and Xu, Jie and Li, Chao",
    title = "Performance-aware Speculative Resource Oversubscription for Large-scale Clusters",
    abstract = "It is a long-standing challenge to achieve a high degree of resource utilization in cluster scheduling. Resource oversubscription has become a common practice in improving resource utilization and cost reduction. However, current centralizedapproaches to oversubscription suffer from the issue with resource mismatch and fail to take into account other performance requirements, e.g., tail latency. In this paper we present ROSE, a new resource management platform capable of conducting performance-aware resource oversubscription. ROSE allows latency-sensitive long-running applications (LRAs) to co-exist with computation-intensive batch jobs. Instead of waiting for resource allocation to be confirmed by the centralized scheduler, job managers in ROSE can independently request to launch speculative tasks within specific machines according to their suitability for oversubscription. Node agents of those machines can however avoid any excessive resource oversubscription by means of a mechanism for admission control using multi-resource threshold control and performance-aware resource throttle. Experiments show that in case of mixed co-location of batch jobs and latency-sensitive LRAs, the CPU utilization and the disk utilization can reach56.34\% and 43.49\%, respectively, but the 95th percentile of read latency in YCSB workloads only increases by 5.4\% against the case of executing the LRAs alone.",
    keywords = "Resource scheduling, Oversubscription, Cluster utilization, Resource throttling, QoS",
    note = "{\textcopyright}2020 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.",
    year = "2020",
    month = "January",
    day = "28",
    doi = "10.1109/TPDS.2020.2970013",
    language = "English",
    volume = "31",
    pages = "1499--1517",
    journal = "IEEE Transactions on Parallel and Distributed Systems",
    issn = "1045-9219",
    publisher = "IEEE Computer Society",
    number = "7",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/287859274/tpds2020\_rose.pdf"
    }


2019

  • [21] D. Lindsay, S. Gill, and P. Garraghan, “PRISM: An Experiment Framework for Straggler Analytics in Containerized Clusters,” in WoC 2019 Fifth International Workshop on Container Technologies and Container Clouds, 2019, p. 13–18. doi:10.1145/3366615.3368353
    [BibTeX] [Abstract] [pdf][Download PDF]

    Containerized clusters of machines at scale that provision Cloud services are encountering substantive difficulties with stragglers – whereby a small subset of task execution negatively degrades system performance. Stragglers are an unsolved challenge due to a wide variety of root-causes and stochastic behavior. While there have been efforts to mitigate their effects, few works have attempted to empirically ascertain how system operational scenarios precisely influence straggler occurrence and severity. This challenge is further compounded with the difficulties of conducting experiments within real-world containerized clusters. System maintenance and experiment design are often error-prone and time-consuming processes, and a large portion of tools created for workload submission and straggler injection are bespoke to specific clusters, limiting experiment reproducibility. In this paper we propose PRISM, a framework that automates containerized cluster setup, experiment design, and experiment execution. Our framework is capable of deployment, configuration, execution, performance trace transformation and aggregation of containerized application frameworks, enabling scripted execution of diverse workloads and cluster configurations. The framework reduces time required for cluster setup and experiment execution from hours to minutes. We use PRISM to conduct automated experimentation of system operational conditions and identify straggler manifestation is affected by resource contention, input data size and scheduler architecture limitations.

    @inproceedings{dc5d81c680f74916a39de57ec5435d53,
    author = "Lindsay, Dominic and Gill, Sukhpal and Garraghan, Peter",
    title = "PRISM: An Experiment Framework for Straggler Analytics in Containerized Clusters",
    abstract = "Containerized clusters of machines at scale that provision Cloud services are encountering substantive difficulties with stragglers -- whereby a small subset of task execution negatively degrades system performance. Stragglers are an unsolved challenge due to a wide variety of root-causes and stochastic behavior. While there have been efforts to mitigate their effects, few works have attempted to empirically ascertain how system operational scenarios precisely influence straggler occurrence and severity. This challenge is further compounded with the difficulties of conducting experiments within real-world containerized clusters. System maintenance and experiment design are often error-prone and time-consuming processes, and a large portion of tools created for workload submission and straggler injection are bespoke to specific clusters, limiting experiment reproducibility. In this paper we propose PRISM, a framework that automates containerized cluster setup, experiment design, and experiment execution. Our framework is capable of deployment, configuration, execution, performance trace transformation and aggregation of containerized application frameworks, enabling scripted execution of diverse workloads and cluster configurations. The framework reduces time required for cluster setup and experiment execution from hours to minutes. We use PRISM to conduct automated experimentation of system operational conditions and identify straggler manifestation is affected by resource contention, input data size and scheduler architecture limitations.",
    keywords = "Straggler, Containers, Datacenters, Clusters",
    note = "{\textcopyright} ACM, 2019. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in WOC '19: Proceedings of the 5th International Workshop on Container Technologies and Container Clouds 2019 https://dl.acm.org/doi/abs/10.1145/3366615.3368353",
    year = "2019",
    month = "December",
    day = "1",
    doi = "10.1145/3366615.3368353",
    language = "English",
    isbn = "9781450370332",
    pages = "13--18",
    booktitle = "WoC 2019 Fifth International Workshop on Container Technologies and Container Clouds",
    publisher = "ACM",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/276645674/SUBMITTED\_2019\_PRISM\_An\_Experiment\_Framework\_for\_Straggler\_Analytics\_in\_Containerized\_Clusters.pdf"
    }


  • [22] S. Gill, S. Tuli, M. Xu, I. Singh, K. Singh, D. Lindsay, S. Tuli, D. Smirnova, M. Singh, U. Jain, H. Pervaiz, B. Sehgal, {. S. Kaila, S. Misra, {. S. Aslanpour, H. Mehta, V. Stankovski, and P. Garraghan, “Transformative Effects of IoT, Blockchain and Artificial Intelligence on Cloud Computing: Evolution, Vision, Trends and Open Challenges,” Internet of Things, vol. 8, 2019. doi:10.1016/j.iot.2019.100118
    [BibTeX] [Abstract] [pdf][Download PDF]

    Cloud computing plays a critical role in modern society and enables a range of applications from infrastructure to social media. Such system must cope with varying load and evolving usage reflecting societies{’} interaction and dependency on automated computing systems whilst satisfying Quality of Service (QoS) guarantees. Enabling these systems are a cohort of conceptual technologies, synthesized to meet demand of evolving computing applications. In order to understand current and future challenges of such system, there is a need to identify key technologies enabling future applications. In this study, we aim to explore how three emerging paradigms (Blockchain, IoT and Artificial Intelligence) will influence future cloud computing systems. Further, we identify several technologies driving these paradigms and invite international experts to discuss the current status and future directions of cloud computing. Finally, we proposed a conceptual model for cloud futurology to explore the influence of emerging paradigms and technologies on evolution of cloud computing.

    @article{d481598e06d64c118ea894819f5b5541,
    author = "Gill, Sukhpal and Tuli, Shreshth and Xu, Minxian and Singh, Inderpreet and Singh, Karan and Lindsay, Dominic and Tuli, Shikhar and Smirnova, Daria and Singh, Manmeet and Jain, Udit and Pervaiz, Haris and Sehgal, Bhanu and Kaila, {Sukhwinder Singh} and Misra, Sanjay and Aslanpour, {Mohammad Sadegh} and Mehta, Harshit and Stankovski, Vlado and Garraghan, Peter",
    title = "Transformative Effects of IoT, Blockchain and Artificial Intelligence on Cloud Computing: Evolution, Vision, Trends and Open Challenges",
    abstract = "Cloud computing plays a critical role in modern society and enables a range of applications from infrastructure to social media. Such system must cope with varying load and evolving usage reflecting societies{\textquoteright} interaction and dependency on automated computing systems whilst satisfying Quality of Service (QoS) guarantees. Enabling these systems are a cohort of conceptual technologies, synthesized to meet demand of evolving computing applications. In order to understand current and future challenges of such system, there is a need to identify key technologies enabling future applications. In this study, we aim to explore how three emerging paradigms (Blockchain, IoT and Artificial Intelligence) will influence future cloud computing systems. Further, we identify several technologies driving these paradigms and invite international experts to discuss the current status and future directions of cloud computing. Finally, we proposed a conceptual model for cloud futurology to explore the influence of emerging paradigms and technologies on evolution of cloud computing.",
    keywords = "Internet of Things (IoT), Cloud computing, Artificial Intelligence, Blockchain",
    note = "This is the author{\textquoteright}s version of a work that was accepted for publication in Internet of Things. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Internet of Things, 8, 2019 DOI: 10.1016/j.iot.2019.100118",
    year = "2019",
    month = "December",
    day = "1",
    doi = "10.1016/j.iot.2019.100118",
    language = "English",
    volume = "8",
    journal = "Internet of Things",
    issn = "2542-6605",
    publisher = "Elsevier",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/278426613/Tranformative\_IoT.pdf"
    }


  • [23] {. S. Gill, P. Garraghan, V. Stankovski, G. Casale, {. K. }. Thulasiram, {. K. }. Ghosh, K. Ramamohanarao, and R. Buyya, “Holistic Resource Management for Sustainable and Reliable Cloud Computing: An Innovative Solution to Global Challenge,” Journal of Systems and Software, vol. 155, p. 104–129, 2019. doi:10.1016/j.jss.2019.05.025
    [BibTeX] [Abstract] [pdf][Download PDF]

    Minimizing the energy consumption of servers within cloud computing systems is of upmost importance to cloud providers towards reducing operational costs and enhancing service sustainability by consolidating services onto fewer active servers. Moreover, providers must also provision high levels of availability and reliability, hence cloud services are frequently replicated across servers that subsequently increases server energy consumption and resource overhead. These two objectives can present a potential conflict within cloud resource management decision making that must balance between service consolidation and replication to minimize energy consumption whilst maximizing server availability and reliability, respectively. In this paper, we propose a cuckoo optimization-based energy-reliability aware resource scheduling technique (CRUZE) for holistic management of cloud computing resources including servers, networks, storage, and cooling systems. CRUZE clusters and executes heterogeneous workloads on provisioned cloud resources and enhances the energy-efficiency and reduces the carbon footprint in datacenters without adversely affecting cloud service reliability. We evaluate the effectiveness of CRUZE against existing state-of-the-art solutions using the CloudSim toolkit. Results indicate that our proposed technique is capable of reducing energy consumption by 20.1\% whilst improving reliability and CPU utilization by 17.1\% and 15.7\% respectively without affecting other Quality of Service parameters.

    @article{75d78e7442d043cc9be35bca557a1f81,
    author = "Gill, {Sukhpal Singh} and Garraghan, Peter and Stankovski, Vlado and Casale, Giuliano and Thulasiram, {Ruppa K.} and Ghosh, {Soumya K.} and Ramamohanarao, Kotagiri and Buyya, Rajkumar",
    title = "Holistic Resource Management for Sustainable and Reliable Cloud Computing: An Innovative Solution to Global Challenge",
    abstract = "Minimizing the energy consumption of servers within cloud computing systems is of upmost importance to cloud providers towards reducing operational costs and enhancing service sustainability by consolidating services onto fewer active servers. Moreover, providers must also provision high levels of availability and reliability, hence cloud services are frequently replicated across servers that subsequently increases server energy consumption and resource overhead. These two objectives can present a potential conflict within cloud resource management decision making that must balance between service consolidation and replication to minimize energy consumption whilst maximizing server availability and reliability, respectively. In this paper, we propose a cuckoo optimization-based energy-reliability aware resource scheduling technique (CRUZE) for holistic management of cloud computing resources including servers, networks, storage, and cooling systems. CRUZE clusters and executes heterogeneous workloads on provisioned cloud resources and enhances the energy-efficiency and reduces the carbon footprint in datacenters without adversely affecting cloud service reliability. We evaluate the effectiveness of CRUZE against existing state-of-the-art solutions using the CloudSim toolkit. Results indicate that our proposed technique is capable of reducing energy consumption by 20.1\% whilst improving reliability and CPU utilization by 17.1\% and 15.7\% respectively without affecting other Quality of Service parameters.",
    keywords = "Cloud Computing, Energy Consumption, Sustainability, Reliability, Holistic Management, Cloud Datacenters",
    note = "This is the author{\textquoteright}s version of a work that was accepted for publication in Journal of Systems and Software. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Journal of Systems and Software, 155, 2019 DOI: 10.1016/j.jss.2019.05.025",
    year = "2019",
    month = "September",
    day = "1",
    doi = "10.1016/j.jss.2019.05.025",
    language = "English",
    volume = "155",
    pages = "104--129",
    journal = "Journal of Systems and Software",
    issn = "0164-1212",
    publisher = "Elsevier Inc.",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/263698565/Holistic\_Resource\_Management\_for\_Sustainable\_and\_Reliable\_Cloud\_Computing.pdf"
    }


  • [24] {. S. Gill, P. Garraghan, and R. Buyya, “ROUTER: Fog Enabled Cloud based Intelligent Resource Management Approach for Smart Home IoT Devices,” Journal of Systems and Software, vol. 154, p. 125–138, 2019. doi:10.1016/j.jss.2019.04.058
    [BibTeX] [Abstract] [pdf][Download PDF]

    There is a growing requirement for Internet of Things (IoT) infrastructure to ensure low response time to provision latency-sensitive real-time applications such as health monitoring, disaster management, and smart homes. Fog computing offers a means to provide such requirements, via a virtualized intermediate layer to provide data, computation, storage, and networking services between Cloud datacenters and end users. A key element within such Fog computing environments is resource management. While there are existing resource manager in Fog computing, they only focus on a subset of parameters important to Fog resource management encompassing system response time, network bandwidth, energy consumption and latency. To date no existing Fog resource manager considers these parameters simultaneously for decision making, which in the context of smart homes will become increasingly key. In this paper, we propose a novel resource management technique (ROUTER) for fog-enabled Cloud computing environments, which leverages Particle Swarm Optimization to optimize simultaneously. The approach is validated within an IoT-based smart home automation scenario, and evaluated within iFogSim toolkit driven by empirical models within a small-scale smart home experiment. Results demonstrate our approach results a reduction of 12\% network bandwidth, 10\% response time, 14\% latency and 12.35\% in energy consumption.

    @article{bfcf32901b00411a92ccf9009fcc6853,
    author = "Gill, {Sukhpal Singh} and Garraghan, Peter and Buyya, Rajkumar",
    title = "ROUTER: Fog Enabled Cloud based Intelligent Resource Management Approach for Smart Home IoT Devices",
    abstract = "There is a growing requirement for Internet of Things (IoT) infrastructure to ensure low response time to provision latency-sensitive real-time applications such as health monitoring, disaster management, and smart homes. Fog computing offers a means to provide such requirements, via a virtualized intermediate layer to provide data, computation, storage, and networking services between Cloud datacenters and end users. A key element within such Fog computing environments is resource management. While there are existing resource manager in Fog computing, they only focus on a subset of parameters important to Fog resource management encompassing system response time, network bandwidth, energy consumption and latency. To date no existing Fog resource manager considers these parameters simultaneously for decision making, which in the context of smart homes will become increasingly key. In this paper, we propose a novel resource management technique (ROUTER) for fog-enabled Cloud computing environments, which leverages Particle Swarm Optimization to optimize simultaneously. The approach is validated within an IoT-based smart home automation scenario, and evaluated within iFogSim toolkit driven by empirical models within a small-scale smart home experiment. Results demonstrate our approach results a reduction of 12\% network bandwidth, 10\% response time, 14\% latency and 12.35\% in energy consumption.",
    keywords = "Fog Computing, Cloud Computing, Internet of Things, Smart Home, Resource Management, Edge Computing",
    note = "This is the author{\textquoteright}s version of a work that was accepted for publication in Journal of Systems and Software. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Journal of Systems and Software, 154, 2019 DOI: 10.1016/j.jss.2019.04.058",
    year = "2019",
    month = "August",
    day = "1",
    doi = "10.1016/j.jss.2019.04.058",
    language = "English",
    volume = "154",
    pages = "125--138",
    journal = "Journal of Systems and Software",
    issn = "0164-1212",
    publisher = "Elsevier Inc.",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/261421766/1\_s2.0\_S0164121219300986\_main.pdf"
    }


  • [25] P. Garraghan, X. Ouyang, R. Yang, D. McKee, and J. Xu, “Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters,” IEEE Transactions on Services Computing, vol. 12, iss. 1, p. 91–104, 2019. doi:10.1109/TSC.2016.2611578
    [BibTeX] [Abstract] [pdf][Download PDF]

    Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as “Long Tail”, whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mitigation, there is limited work that empirically studies straggler root-cause and quantifies its impact upon system operation. Such analysis is critical to ascertain in-depth knowledge of straggler occurrence for focusing developmental and research efforts towards solving the Long Tail challenge. This paper provides an empirical analysis of straggler root-cause within virtualized Cloud datacenters; we analyze two large-scale production systems to quantify the frequency and impact stragglers impose, and propose a method for conducting root-cause analysis. Results demonstrate approximately 5\% of task stragglers impact 50\% of total jobs for batch processes, and 53\% of stragglers occur due to high server resource utilization. We leverage these findings to propose a method for extreme straggler detection through a combination of offline execution patterns modeling and online analytic agents to monitor tasks at runtime. Experiments show the approach is capable of detecting stragglers less than 11\% into their execution lifecycle with 95\% accuracy for short duration jobs.

    @article{6d6eda9a4aed443891e68b4ee023be3f,
    author = "Garraghan, Peter and Ouyang, Xue and Yang, Renyu and McKee, David and Xu, Jie",
    title = "Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters",
    abstract = "Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as “Long Tail”, whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mitigation, there is limited work that empirically studies straggler root-cause and quantifies its impact upon system operation. Such analysis is critical to ascertain in-depth knowledge of straggler occurrence for focusing developmental and research efforts towards solving the Long Tail challenge. This paper provides an empirical analysis of straggler root-cause within virtualized Cloud datacenters; we analyze two large-scale production systems to quantify the frequency and impact stragglers impose, and propose a method for conducting root-cause analysis. Results demonstrate approximately 5\% of task stragglers impact 50\% of total jobs for batch processes, and 53\% of stragglers occur due to high server resource utilization. We leverage these findings to propose a method for extreme straggler detection through a combination of offline execution patterns modeling and online analytic agents to monitor tasks at runtime. Experiments show the approach is capable of detecting stragglers less than 11\% into their execution lifecycle with 95\% accuracy for short duration jobs.",
    keywords = "Cloud computing, Straggler, Distributed Systems, Root-cause analysis, Datacenter",
    note = "{\textcopyright} 2019 IEEE. This is an author produced version of a paper published in IEEE Transactions on Services Computing. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Uploaded in accordance with the publisher's self-archiving policy.",
    year = "2019",
    month = "January",
    day = "1",
    doi = "10.1109/TSC.2016.2611578",
    language = "English",
    volume = "12",
    pages = "91--104",
    journal = "IEEE Transactions on Services Computing",
    issn = "1939-1374",
    publisher = "Institute of Electrical and Electronics Engineers",
    number = "1",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/134463719/tsc2016b.pdf"
    }


2018

  • [26] P. Garraghan, R. Yang, Z. Wen, A. Romanovsky, J. Xu, R. Buyya, and R. Ranjan, “Emergent Failures: Rethinking Cloud Reliability at Scale,” IEEE Cloud Computing, vol. 5, iss. 5, p. 12–21, 2018. doi:10.1109/MCC.2018.053711662
    [BibTeX] [Abstract] [pdf][Download PDF]

    Since the conception of cloud computing, ensuring its ability to provide highly reliable service has been of the upmost importance and criticality to the business objectives of providers and their customers. This has held true for every facet of the system, encompassing applications, resource management, the underlying computing infrastructure, and environmental cooling. Thus, the cloud-computing and dependability research communities have exerted considerable effort toward enhancing the reliability of system components against various software and hardware failures. However, as these systems have continued to grow in scale, with heterogeneity and complexity resulting in the manifestation of emergent behavior, so too have their respective failures. Recent studies of production cloud datacenters indicate the existence of complex failure manifestations that existing fault tolerance and recovery strategies are ill-equipped to effectively handle. These strategies can even be responsible for such failures. These emergent failures-frequently transient and identifiable only at runtime-represent a significant threat to designing reliable cloud systems. This article identifies the challenges of emergent failures in cloud datacenters at scale and their impact on system resource management, and discusses potential directions of further study for Internet of Things integration and holistic fault tolerance.

    @article{a790999654954f8e8905cfe79abaa7e3,
    author = "Garraghan, Peter and Yang, Renyu and Wen, Zhenyu and Romanovsky, Alexander and Xu, Jie and Buyya, Rajkumar and Ranjan, Rajiv",
    title = "Emergent Failures: Rethinking Cloud Reliability at Scale",
    abstract = "Since the conception of cloud computing, ensuring its ability to provide highly reliable service has been of the upmost importance and criticality to the business objectives of providers and their customers. This has held true for every facet of the system, encompassing applications, resource management, the underlying computing infrastructure, and environmental cooling. Thus, the cloud-computing and dependability research communities have exerted considerable effort toward enhancing the reliability of system components against various software and hardware failures. However, as these systems have continued to grow in scale, with heterogeneity and complexity resulting in the manifestation of emergent behavior, so too have their respective failures. Recent studies of production cloud datacenters indicate the existence of complex failure manifestations that existing fault tolerance and recovery strategies are ill-equipped to effectively handle. These strategies can even be responsible for such failures. These emergent failures-frequently transient and identifiable only at runtime-represent a significant threat to designing reliable cloud systems. This article identifies the challenges of emergent failures in cloud datacenters at scale and their impact on system resource management, and discusses potential directions of further study for Internet of Things integration and holistic fault tolerance.",
    note = "{\textcopyright}2018 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.",
    year = "2018",
    month = "October",
    day = "18",
    doi = "10.1109/MCC.2018.053711662",
    language = "English",
    volume = "5",
    pages = "12--21",
    journal = "IEEE Cloud Computing",
    issn = "2325-6095",
    publisher = "IEEE",
    number = "5",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/249250830/Emergent\_Failures.pdf"
    }


  • [27] A. Saeed, P. Garraghan, B. Craggs, D. {van der Linden}, A. Rashid, and S. {Asad Hussain}, “A Cross-Virtual Machine Network Channel Attack via Mirroring and TAP Impersonation,” in 2018 IEEE International Conference on Cloud Computing (CLOUD), 2018, p. 606–613. doi:10.1109/CLOUD.2018.00084
    [BibTeX] [Abstract] [pdf][Download PDF]

    Data privacy and security is a leading concern for providers and customers of cloud computing, where Virtual Machines (VMs) can co-reside within the same underlying physical machine. Side channel attacks within multi-tenant virtualized cloud environments are an established problem, where attackers are able to monitor and exfiltrate data from co-resident VMs. Virtualization services have attempted to mitigate such attacks by preventing VM-to-VM interference on shared hardware by providing logical resource isolation between co-located VMs via an internal virtual network. However, such approaches are also insecure, with attackers capable of performing network channel attacks which bypass mitigation strategies using vectors such as ARP Spoofing, TCP/IP steganography, and DNS poisoning. In this paper we identify a new vulnerability within the internal cloud virtual network, showing that through a combination of TAP impersonation and mirroring, a malicious VM can successfully redirect and monitor network traffic of VMs co-located within the same physical machine. We demonstrate the feasibility of this attack in a prominent cloud platform – OpenStack – under various security requirements and system conditions, and propose countermeasures for mitigation.

    @inproceedings{021824af97724c1ea68ad219ffb0b1e9,
    author = "Saeed, Atif and Garraghan, Peter and Craggs, Barnaby and {van der Linden}, Dirk and Rashid, Awais and {Asad Hussain}, Syed",
    title = "A Cross-Virtual Machine Network Channel Attack via Mirroring and TAP Impersonation",
    abstract = "Data privacy and security is a leading concern for providers and customers of cloud computing, where Virtual Machines (VMs) can co-reside within the same underlying physical machine. Side channel attacks within multi-tenant virtualized cloud environments are an established problem, where attackers are able to monitor and exfiltrate data from co-resident VMs. Virtualization services have attempted to mitigate such attacks by preventing VM-to-VM interference on shared hardware by providing logical resource isolation between co-located VMs via an internal virtual network. However, such approaches are also insecure, with attackers capable of performing network channel attacks which bypass mitigation strategies using vectors such as ARP Spoofing, TCP/IP steganography, and DNS poisoning. In this paper we identify a new vulnerability within the internal cloud virtual network, showing that through a combination of TAP impersonation and mirroring, a malicious VM can successfully redirect and monitor network traffic of VMs co-located within the same physical machine. We demonstrate the feasibility of this attack in a prominent cloud platform – OpenStack – under various security requirements and system conditions, and propose countermeasures for mitigation.",
    keywords = "Cloud Computing, Channel Attack, Security",
    note = "{\textcopyright}2018 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.",
    year = "2018",
    month = "July",
    day = "2",
    doi = "10.1109/CLOUD.2018.00084",
    language = "English",
    pages = "606--613",
    booktitle = "2018 IEEE International Conference on Cloud Computing (CLOUD)",
    publisher = "IEEE",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/238600181/Cloud\_Cross\_VM\_Attack.pdf"
    }


  • [28] X. Sun, C. Hu, R. Yang, P. Garraghan, T. Wo, J. Xu, J. Zhu, and C. Li, “ROSE: Cluster Resource Scheduling via Speculative Over-subscription,” in 38th IEEE International Conference on Distributed Systems Computing Systems, 2018, p. 949–960. doi:10.1109/ICDCS.2018.00096
    [BibTeX] [Abstract] [pdf][Download PDF]

    A long-standing challenge in cluster scheduling is to achieve a high degree of utilization of heterogeneous resources in a cluster. In practice there exists a substantial disparity between perceived and actual resource utilization. A scheduler might regard a cluster as fully utilized if a large resource request queue is present, but the actual resource utilization of the cluster can be in fact very low. This disparity results in the formation of idle resources, leading to inefficient resource usage and incurring high operational costs and an inability to provision services. In this paper we present a new cluster scheduling system, ROSE, that is based on a multi-layered scheduling architecture with an ability to over-subscribe idle resources to accommodate unfulfilled resource requests. ROSE books idle resources in a speculative manner:instead of waiting for resource allocation to be confirmed by the centralized scheduler,it requests intelligently to launch tasks within machines according to their suitability to oversubscribe resources. A threshold control with timely task rescheduling ensures fully-utilized cluster resources without generating potential tasks tragglers. Experimental results show that ROSE can almost double the average CPU utilization, from 36.37\% to 65.10\%, compared with a centralized scheduling scheme, and reduce the workload makespan by 30.11\%, with an 8.23\% disk utilization improvement over other scheduling strategies.

    @inproceedings{3f23d24ff7ee45e4b30502cc6fc06e15,
    author = "Sun, Xiaoyang and Hu, Chunming and Yang, Renyu and Garraghan, Peter and Wo, Tianyu and Xu, Jie and Zhu, Jianyong and Li, Chao",
    title = "ROSE: Cluster Resource Scheduling via Speculative Over-subscription",
    abstract = "A long-standing challenge in cluster scheduling is to achieve a high degree of utilization of heterogeneous resources in a cluster. In practice there exists a substantial disparity between perceived and actual resource utilization. A scheduler might regard a cluster as fully utilized if a large resource request queue is present, but the actual resource utilization of the cluster can be in fact very low. This disparity results in the formation of idle resources, leading to inefficient resource usage and incurring high operational costs and an inability to provision services. In this paper we present a new cluster scheduling system, ROSE, that is based on a multi-layered scheduling architecture with an ability to over-subscribe idle resources to accommodate unfulfilled resource requests. ROSE books idle resources in a speculative manner:instead of waiting for resource allocation to be confirmed by the centralized scheduler,it requests intelligently to launch tasks within machines according to their suitability to oversubscribe resources. A threshold control with timely task rescheduling ensures fully-utilized cluster resources without generating potential tasks tragglers. Experimental results show that ROSE can almost double the average CPU utilization, from 36.37\% to 65.10\%, compared with a centralized scheduling scheme, and reduce the workload makespan by 30.11\%, with an 8.23\% disk utilization improvement over other scheduling strategies.",
    year = "2018",
    month = "July",
    day = "2",
    doi = "10.1109/ICDCS.2018.00096",
    language = "English",
    series = "2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS)",
    publisher = "IEEE",
    pages = "949--960",
    booktitle = "38th IEEE International Conference on Distributed Systems Computing Systems",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/237760565/ICDCS\_Camera\_Ready.pdf"
    }


  • [29] X. Li, P. Garraghan, X. Jiang, Z. Wu, and J. Xu, “Holistic virtual machine scheduling in cloud datacenters towards minimizing total energy,” IEEE Transactions on Parallel and Distributed Systems, vol. 29, iss. 6, p. 1317–1331, 2018. doi:10.1109/TPDS.2017.2688445
    [BibTeX] [Abstract] [pdf][Download PDF]

    Energy consumed by Cloud datacenters has dramatically increased, driven by rapid uptake of applications and services globally provisioned through virtualization. By applying energy-aware virtual machine scheduling, Cloud providers are able to achieve enhanced energy efficiency and reduced operation cost. Energy consumption of datacenters consists of computing energy and cooling energy. However, due to the complexity of energy and thermal modeling of realistic Cloud datacenter operation, traditional approaches are unable to provide a comprehensive in-depth solution for virtual machine scheduling which encompasses both computing and cooling energy. This paper addresses this challenge by presenting an elaborate thermal model that analyzes the temperature distribution of airflow and server CPU. We propose GRANITE – a holistic virtual machine scheduling algorithm capable of minimizing total datacenter energy consumption. The algorithm is evaluated against other existing workload scheduling algorithms MaxUtil, TASA, IQR and Random using real Cloud workload characteristics extracted from Google datacenter tracelog. Results demonstrate that GRANITE consumes 4.3\% – 43.6\% less total energy in comparison to the state-of-the-art, and reduces the probability of critical temperature violation by 99.2\% with 0.17\% SLA violation rate as the performance penalty.

    @article{5b8b132c7a7747f3b2cb8ac63b9ca8ff,
    author = "Li, Xiang and Garraghan, Peter and Jiang, Xiaohong and Wu, Zhaohui and Xu, Jie",
    title = "Holistic virtual machine scheduling in cloud datacenters towards minimizing total energy",
    abstract = "Energy consumed by Cloud datacenters has dramatically increased, driven by rapid uptake of applications and services globally provisioned through virtualization. By applying energy-aware virtual machine scheduling, Cloud providers are able to achieve enhanced energy efficiency and reduced operation cost. Energy consumption of datacenters consists of computing energy and cooling energy. However, due to the complexity of energy and thermal modeling of realistic Cloud datacenter operation, traditional approaches are unable to provide a comprehensive in-depth solution for virtual machine scheduling which encompasses both computing and cooling energy. This paper addresses this challenge by presenting an elaborate thermal model that analyzes the temperature distribution of airflow and server CPU. We propose GRANITE – a holistic virtual machine scheduling algorithm capable of minimizing total datacenter energy consumption. The algorithm is evaluated against other existing workload scheduling algorithms MaxUtil, TASA, IQR and Random using real Cloud workload characteristics extracted from Google datacenter tracelog. Results demonstrate that GRANITE consumes 4.3\% - 43.6\% less total energy in comparison to the state-of-the-art, and reduces the probability of critical temperature violation by 99.2\% with 0.17\% SLA violation rate as the performance penalty.",
    keywords = "Cloud computing, energy efficiency, datacenter modeling, workload scheduling, virtual machine",
    note = "{\textcopyright}2017 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.",
    year = "2018",
    month = "June",
    day = "1",
    doi = "10.1109/TPDS.2017.2688445",
    language = "English",
    volume = "29",
    pages = "1317--1331",
    journal = "IEEE Transactions on Parallel and Distributed Systems",
    issn = "1045-9219",
    publisher = "IEEE Computer Society",
    number = "6",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/161571104/Holistic\_Virtual\_Machine\_Scheduling\_in\_Cloud\_Datacenters\_towards\_Minimizing\_Total\_Energy\_Accepted\_.pdf"
    }


  • [30] X. Ouyang, P. Garraghan, B. Primas, D. McKee, P. Townend, and J. Xu, “Adaptive Speculation for Efficient Internetware Application Execution in Clouds,” ACM Transactions on Internet Technology, vol. 18, iss. 2, 2018. doi:10.1145/3093896
    [BibTeX] [Abstract] [pdf][Download PDF]

    Modern Cloud computing systems are massive in scale, featuring environments that can execute highly dynamic Internetware applications with huge numbers of interacting tasks. This has led to a substantial challenge−the straggler problem, whereby a small subset of slow tasks significantly impede parallel job completion. This problem results in longer service responses, degraded system performance, and late timing failures that can easily threaten Quality of Service (QoS) compliance. Speculative execution (or speculation) is the prominent method deployed in Clouds to tolerate stragglersbycreatingtaskreplicasatruntime.Themethoddetectsstragglersbyspecifyingapredefinedthresholdtocalculate the difference between individual tasks and the average task progression within a job. However, such a static threshold debilitates speculation effectiveness as it fails to capture the intrinsic diversity of timing constraints in Internetware applications, as well as dynamic environmental factors such as resource utilization. By considering such characteristics, different levels of strictness for replica creation can be imposed to adaptively achieve specified levels of QoS for different applications. In this paper we present an algorithm to improve the execution efficiency of Internetware applications by dynamically calculating the straggler threshold, considering key parameters including job QoS timing constraints, task execution progress, and optimal system resource utilization. We implement this dynamic straggler threshold into the YARN architecture to evaluate it{’}s effectiveness against existing state-of-the-art solutions. Results demonstrate that the proposed approach is capable of reducing parallel job response times by up to 20\% compared to the static threshold, as well as a higher speculation success rate, achieving up to 66.67\% against 16.67\% in comparison to the static method.

    @article{e764dbbe13024e4b9f26aac005dfa1bb,
    author = "Ouyang, Xue and Garraghan, Peter and Primas, Bernhard and McKee, David and Townend, Paul and Xu, Jie",
    title = "Adaptive Speculation for Efficient Internetware Application Execution in Clouds",
    abstract = "Modern Cloud computing systems are massive in scale, featuring environments that can execute highly dynamic Internetware applications with huge numbers of interacting tasks. This has led to a substantial challenge−the straggler problem, whereby a small subset of slow tasks significantly impede parallel job completion. This problem results in longer service responses, degraded system performance, and late timing failures that can easily threaten Quality of Service (QoS) compliance. Speculative execution (or speculation) is the prominent method deployed in Clouds to tolerate stragglersbycreatingtaskreplicasatruntime.Themethoddetectsstragglersbyspecifyingapredefinedthresholdtocalculate the difference between individual tasks and the average task progression within a job. However, such a static threshold debilitates speculation effectiveness as it fails to capture the intrinsic diversity of timing constraints in Internetware applications, as well as dynamic environmental factors such as resource utilization. By considering such characteristics, different levels of strictness for replica creation can be imposed to adaptively achieve specified levels of QoS for different applications. In this paper we present an algorithm to improve the execution efficiency of Internetware applications by dynamically calculating the straggler threshold, considering key parameters including job QoS timing constraints, task execution progress, and optimal system resource utilization. We implement this dynamic straggler threshold into the YARN architecture to evaluate it{\textquoteright}s effectiveness against existing state-of-the-art solutions. Results demonstrate that the proposed approach is capable of reducing parallel job response times by up to 20\% compared to the static threshold, as well as a higher speculation success rate, achieving up to 66.67\% against 16.67\% in comparison to the static method.",
    keywords = "Stragglers, Replicas, QoS, Adaptive Speculation, Execution Efficiency",
    note = "{\textcopyright}ACM, 2018. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ACM Transactions on Internet Technology (TOIT) http://dx.doi.org/10.1145/3093896",
    year = "2018",
    month = "January",
    doi = "10.1145/3093896",
    language = "English",
    volume = "18",
    journal = "ACM Transactions on Internet Technology",
    issn = "1533-5399",
    publisher = "ASSOC COMPUTING MACHINERY",
    number = "2",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/169788538/Adaptive\_Speculation\_for\_Efficient\_Internetware\_Application\_Execution\_in\_Clouds.pdf"
    }


  • [31] X. Li, X. Jiang, P. Garraghan, and Z. Wu, “Holistic energy and failure aware workload scheduling in Cloud datacenters,” Future Generation Computer Systems, vol. 78, iss. 3, p. 887–900, 2018. doi:10.1016/j.future.2017.07.044
    [BibTeX] [Abstract] [pdf][Download PDF]

    The global uptake of Cloud computing has attracted increased interest within both academia and industry resulting in the formation of large-scale and complex distributed systems. This has led to increased failure occurrence within computing systems that induce substantial negative impact upon system performance and task reliability perceived by users. Such systems also consume vast quantities of power, resulting in significant operational costs perceived by providers. Virtualization – a commonly deployed technology within Cloud datacenters – can enable flexible scheduling of virtual machines to maximize system reliability and energy-efficiency. However, existing work address these two objectives separately, providing limited understanding towards studying the explicit trade-offs towards dependable and energy-efficient compute infrastructure. In this paper, we propose two failure-aware energy-efficient scheduling algorithms that exploit the holistic operational characteristics of the Cloud datacenter comprising the cooling unit, computing infrastructure and server failures. By comprehensively modeling the power and failure profiles of a Cloud datacenter, we propose workload scheduling algorithms Ella-W and Ella-B, capable of reducing cooling and compute energy while minimizing the impact of system failures. A novel and overall metric is proposed that combines energy efficiency and reliability to specify the performance of various algorithms. We evaluate our algorithms against Random, MaxUtil, TASA, MTTE and OBFIT under various system conditions of failure prediction accuracy and workload intensity. Evaluation results demonstrate that Ella-W can reduce energy usage by 29.5\% and improve task completion rate by 3.6\%, while Ella-B reduces energy usage by 32.7\% with no degradation to task completion rate.

    @article{11978dccda0b44eb923a762c0130bdba,
    author = "Li, Xiang and Jiang, Xiaohong and Garraghan, Peter and Wu, Zhaohui",
    title = "Holistic energy and failure aware workload scheduling in Cloud datacenters",
    abstract = "The global uptake of Cloud computing has attracted increased interest within both academia and industry resulting in the formation of large-scale and complex distributed systems. This has led to increased failure occurrence within computing systems that induce substantial negative impact upon system performance and task reliability perceived by users. Such systems also consume vast quantities of power, resulting in significant operational costs perceived by providers. Virtualization – a commonly deployed technology within Cloud datacenters – can enable flexible scheduling of virtual machines to maximize system reliability and energy-efficiency. However, existing work address these two objectives separately, providing limited understanding towards studying the explicit trade-offs towards dependable and energy-efficient compute infrastructure. In this paper, we propose two failure-aware energy-efficient scheduling algorithms that exploit the holistic operational characteristics of the Cloud datacenter comprising the cooling unit, computing infrastructure and server failures. By comprehensively modeling the power and failure profiles of a Cloud datacenter, we propose workload scheduling algorithms Ella-W and Ella-B, capable of reducing cooling and compute energy while minimizing the impact of system failures. A novel and overall metric is proposed that combines energy efficiency and reliability to specify the performance of various algorithms. We evaluate our algorithms against Random, MaxUtil, TASA, MTTE and OBFIT under various system conditions of failure prediction accuracy and workload intensity. Evaluation results demonstrate that Ella-W can reduce energy usage by 29.5\% and improve task completion rate by 3.6\%, while Ella-B reduces energy usage by 32.7\% with no degradation to task completion rate.",
    keywords = "Energy efficiency, Thermal management, Reliability, Failures, Workload scheduling, Cloud computing",
    note = "This is the author{\textquoteright}s version of a work that was accepted for publication in Future Generation Computer Systems. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Future Generation Computer Systems, 78, 3, 2017 DOI: 10.1016/j.future.2017.07.044",
    year = "2018",
    month = "January",
    doi = "10.1016/j.future.2017.07.044",
    language = "English",
    volume = "78",
    pages = "887--900",
    journal = "Future Generation Computer Systems",
    issn = "0167-739X",
    publisher = "Elsevier",
    number = "3",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/189009815/FGCS\_Energy\_aware\_Failure\_Aware\_Scheduling\_Accepted.pdf"
    }


2017

  • [32] B. Primas, P. Garraghan, D. McKee, J. Summers, and J. Xu, “A Framework and Task Allocation Analysis for Infrastructure Independent Energy-Efficient Scheduling in Cloud Data Centers,” in 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), 2017, p. 178–185. doi:10.1109/CloudCom.2017.26
    [BibTeX] [Abstract] [pdf][Download PDF]

    Cloud computing represents a paradigm shift in provisioning on-demand computational resources underpinned by data center infrastructure, which now constitutes 1.5\% of worldwide energy consumption. Such consumption is not merely limited to operating IT devices, but encompasses cooling systems representing 40\% total data center energy usage. Given the substantivecomplexityandheterogeneityofdatacenteroperation spanning both computing and cooling components, obtaining analytical models for optimizing data center energy-efficiency is an inherently difficult challenge. Specifically, difficulties arise pertaining to the non-intuitive relationship between computing and cooling energy in the data center, computationally complex energy modeling, as well as cooling models restricted to a specific class of data center facility geometry – all of which arise from the interdisciplinary nature of this research domain. In this paper we propose a framework for energy-efficient scheduling to alleviate these challenges. It is applicable to any type of data center infrastructure and does not require complex modeling of energy. Instead, the concept of a target workload distribution is proposed. If the workload is assigned to nodes according to the target workload distribution, then the energy consumption is minimized. The exact target workload distribution is unknown, but an approximated distribution is delivered by the framework. The scheduling objective is to assign workload to nodes such that the workload distribution becomes as similar as possible to the target distribution in order to reduce energy consumption. Several mathematically sound algorithms have been designed to address this novel type of scheduling problem. Simulation results demonstrate that our algorithms reduce the relative deviation by at least 16.9\% and the relative variance by at least 22.67\% in comparison to (asymmetric) load balancing algorithms.

    @inproceedings{d39bd64978c748d39450ff5f9822e3f4,
    author = "Primas, Bernhard and Garraghan, Peter and McKee, David and Summers, Jon and Xu, Jie",
    title = "A Framework and Task Allocation Analysis for Infrastructure Independent Energy-Efficient Scheduling in Cloud Data Centers",
    abstract = "Cloud computing represents a paradigm shift in provisioning on-demand computational resources underpinned by data center infrastructure, which now constitutes 1.5\% of worldwide energy consumption. Such consumption is not merely limited to operating IT devices, but encompasses cooling systems representing 40\% total data center energy usage. Given the substantivecomplexityandheterogeneityofdatacenteroperation spanning both computing and cooling components, obtaining analytical models for optimizing data center energy-efficiency is an inherently difficult challenge. Specifically, difficulties arise pertaining to the non-intuitive relationship between computing and cooling energy in the data center, computationally complex energy modeling, as well as cooling models restricted to a specific class of data center facility geometry - all of which arise from the interdisciplinary nature of this research domain. In this paper we propose a framework for energy-efficient scheduling to alleviate these challenges. It is applicable to any type of data center infrastructure and does not require complex modeling of energy. Instead, the concept of a target workload distribution is proposed. If the workload is assigned to nodes according to the target workload distribution, then the energy consumption is minimized. The exact target workload distribution is unknown, but an approximated distribution is delivered by the framework. The scheduling objective is to assign workload to nodes such that the workload distribution becomes as similar as possible to the target distribution in order to reduce energy consumption. Several mathematically sound algorithms have been designed to address this novel type of scheduling problem. Simulation results demonstrate that our algorithms reduce the relative deviation by at least 16.9\% and the relative variance by at least 22.67\% in comparison to (asymmetric) load balancing algorithms.",
    keywords = "Cloud computing, Energy efficiency, Thermal-aware scheduling, Combinatorial optimization",
    note = "{\textcopyright}2017 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.",
    year = "2017",
    month = "December",
    day = "11",
    doi = "10.1109/CloudCom.2017.26",
    language = "English",
    isbn = "9781538606933",
    pages = "178--185",
    booktitle = "2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)",
    publisher = "IEEE",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/205409226/Infrastructure\_Independent\_Energy\_Efficient\_Scheduling\_in\_Cloud.pdf"
    }


  • [33] R. Yang, Y. Zhang, P. Garraghan, Y. Feng, J. Ouyang, J. Xu, Z. Zhang, and C. Li, “Reliable computing service in massive-scale systems through rapid low-cost failover,” IEEE Transactions on Services Computing, vol. 10, iss. 6, p. 969–983, 2017. doi:10.1109/TSC.2016.2544313
    [BibTeX] [Abstract] [pdf][Download PDF]

    Large-scale distributed systems in Cloud datacenter are capable of provisioning service to consumers with diverse business requirements. Providers face pressure to provision uninterrupted reliable services while reducing operational costs due to significant software and hardware failures. A widely used means to achieve such a goal is using redundant system components to implement usertransparent failover, yet its effectiveness must be balanced carefully without incurring heavy overhead when deployed – an important practical consideration for complex large-scale systems. Failover techniques developed for Cloud systems often suffer serious limitations, including mandatory restart leading to poor cost-effectiveness, as well as solely focusing on crash failures, omitting other important types, e.g. timing failures and simultaneous failures. This paper addresses these limitations by presenting a new approach to user-transparent failover for massive-scale systems. The approach uses soft-state inference to achieve rapid failure recovery and avoid unnecessary restart, with minimal system resource overhead. It also copes with different failures, including correlated and simultaneous events. The proposed approach was implemented, deployed and evaluated within Fuxi system, the underlying resource management system used within Alibaba Cloud. Results demonstrate that our approach tolerates complex failure scenarios while incurring at worst 228.5 microsecond instance overhead with 1.71\% additional CPU usage.

    @article{9ac0e732a6004028a57aa9c058715a77,
    author = "Yang, Renyu and Zhang, Yang and Garraghan, Peter and Feng, Yihui and Ouyang, Jin and Xu, Jie and Zhang, Zhuo and Li, Chao",
    title = "Reliable computing service in massive-scale systems through rapid low-cost failover",
    abstract = "Large-scale distributed systems in Cloud datacenter are capable of provisioning service to consumers with diverse business requirements. Providers face pressure to provision uninterrupted reliable services while reducing operational costs due to significant software and hardware failures. A widely used means to achieve such a goal is using redundant system components to implement usertransparent failover, yet its effectiveness must be balanced carefully without incurring heavy overhead when deployed – an important practical consideration for complex large-scale systems. Failover techniques developed for Cloud systems often suffer serious limitations, including mandatory restart leading to poor cost-effectiveness, as well as solely focusing on crash failures, omitting other important types, e.g. timing failures and simultaneous failures. This paper addresses these limitations by presenting a new approach to user-transparent failover for massive-scale systems. The approach uses soft-state inference to achieve rapid failure recovery and avoid unnecessary restart, with minimal system resource overhead. It also copes with different failures, including correlated and simultaneous events. The proposed approach was implemented, deployed and evaluated within Fuxi system, the underlying resource management system used within Alibaba Cloud. Results demonstrate that our approach tolerates complex failure scenarios while incurring at worst 228.5 microsecond instance overhead with 1.71\% additional CPU usage.",
    keywords = "Failover, Cloud computing, Resource management, Reliability, Services",
    note = "{\textcopyright} 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.",
    year = "2017",
    month = "November",
    doi = "10.1109/TSC.2016.2544313",
    language = "English",
    volume = "10",
    pages = "969--983",
    journal = "IEEE Transactions on Services Computing",
    issn = "1939-1374",
    publisher = "Institute of Electrical and Electronics Engineers",
    number = "6",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/134463487/tsc\_2016\_camera\_ready\_v10.pdf"
    }


  • [34] Z. Wen, R. Yang, P. Garraghan, T. Lin, J. Xu, and M. Rovatsos, “Fog Orchestration for Internet of Things Services,” IEEE Internet Computing, vol. 21, iss. 2, p. 16–24, 2017. doi:10.1109/MIC.2017.36
    [BibTeX] [Abstract] [pdf][Download PDF]

    Abstract:Large-scale Internet of Things (IoT) services such as healthcare, smart cities, and marine monitoring are pervasive in cyber-physical environments strongly supported by Internet technologies and fog computing. Complex IoT services are increasingly composed of sensors, devices, and compute resources within fog computing infrastructures. The orchestration of such applications can be leveraged to alleviate the difficulties of maintenance and enhance data security and system reliability. However, efficiently dealing with dynamic variations and transient operational behavior is a crucial challenge within the context of choreographing complex services. Furthermore, with the rapid increase of the scale of IoT deployments, the heterogeneity, dynamicity, and uncertainty within fog environments and increased computational complexity further aggravate this challenge. This article gives an overview of the core issues, challenges, and future research directions in fog-enabled orchestration for IoT services. Additionally, it presents early experiences of an orchestration scenario, demonstrating the feasibility and initial results of using a distributed genetic algorithm in this context.

    @article{f4d50942308b439ebfddc4e79b4fc9f0,
    author = "Wen, Zhenyu and Yang, Renyu and Garraghan, Peter and Lin, Tao and Xu, Jie and Rovatsos, Michael",
    title = "Fog Orchestration for Internet of Things Services",
    abstract = "Abstract:Large-scale Internet of Things (IoT) services such as healthcare, smart cities, and marine monitoring are pervasive in cyber-physical environments strongly supported by Internet technologies and fog computing. Complex IoT services are increasingly composed of sensors, devices, and compute resources within fog computing infrastructures. The orchestration of such applications can be leveraged to alleviate the difficulties of maintenance and enhance data security and system reliability. However, efficiently dealing with dynamic variations and transient operational behavior is a crucial challenge within the context of choreographing complex services. Furthermore, with the rapid increase of the scale of IoT deployments, the heterogeneity, dynamicity, and uncertainty within fog environments and increased computational complexity further aggravate this challenge. This article gives an overview of the core issues, challenges, and future research directions in fog-enabled orchestration for IoT services. Additionally, it presents early experiences of an orchestration scenario, demonstrating the feasibility and initial results of using a distributed genetic algorithm in this context.",
    keywords = "Internet of Things, Fog Computing, Service Orchestration",
    note = "{\textcopyright}2017 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.",
    year = "2017",
    month = "March",
    day = "1",
    doi = "10.1109/MIC.2017.36",
    language = "English",
    volume = "21",
    pages = "16--24",
    journal = "IEEE Internet Computing",
    issn = "1089-7801",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",
    number = "2",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/189010548/Fog\_Orchestration\_for\_IoT\_Services.pdf"
    }


2016

  • [35] P. Garraghan, Y. Al-Anii, J. Summers, H. Thompson, N. Kapur, and K. Djemame, “A unified model for holistic power usage in cloud datacenter servers,” in UCC ’16 Proceedings of the 9th International Conference on Utility and Cloud Computing, 2016, p. 11–19. doi:10.1145/2996890.2996896
    [BibTeX] [Abstract] [pdf][Download PDF]

    Cloud datacenters are compute facilities formed by hundreds and thousands of heterogeneous servers requiring significant power requirements to operate effectively. Servers are composed by multiple interacting sub-systems including applications, microelectronic processors, and cooling which reflect their respective power profiles via different parameters. What is presently unknown is how to accurately model the holistic power usage of the entire server when including all these sub-systems together. This becomes increasingly challenging when considering diverse utilization patterns, server hardware characteristics, air and liquid cooling techniques, and importantly quantifying the non-electrical energy cost imposed by cooling operation. Such a challenge arises due to the need for multi-disciplinary expertise required to study server operation holistically. This work provides a unified model for capturing holistic power usage within Cloud datacenter servers. Constructed through controlled laboratory experiments, the model captures the relationship of server power usage between software, hardware, and cooling agnostic of architecture and cooling type (air and liquid). An exciting prospect is the ability to quantify the amount of non-electrical power consumed through cooling, allowing for more realistic and accurate server power profiles. This work represents the first empirically supported analysis and modeling of holistic power usage for Cloud datacenter servers, and bridges a significant gap between computer science and mechanical engineering research. Model validation through experiments demonstrates an average standard error of 3\% for server power usage within both air and liquid cooled environments.

    @inproceedings{3ef44600c890497984a45e156b6745c1,
    author = "Garraghan, Peter and Al-Anii, Yaser and Summers, Jon and Thompson, Harvey and Kapur, Nik and Djemame, Karim",
    title = "A unified model for holistic power usage in cloud datacenter servers",
    abstract = "Cloud datacenters are compute facilities formed by hundreds and thousands of heterogeneous servers requiring significant power requirements to operate effectively. Servers are composed by multiple interacting sub-systems including applications, microelectronic processors, and cooling which reflect their respective power profiles via different parameters. What is presently unknown is how to accurately model the holistic power usage of the entire server when including all these sub-systems together. This becomes increasingly challenging when considering diverse utilization patterns, server hardware characteristics, air and liquid cooling techniques, and importantly quantifying the non-electrical energy cost imposed by cooling operation. Such a challenge arises due to the need for multi-disciplinary expertise required to study server operation holistically. This work provides a unified model for capturing holistic power usage within Cloud datacenter servers. Constructed through controlled laboratory experiments, the model captures the relationship of server power usage between software, hardware, and cooling agnostic of architecture and cooling type (air and liquid). An exciting prospect is the ability to quantify the amount of non-electrical power consumed through cooling, allowing for more realistic and accurate server power profiles. This work represents the first empirically supported analysis and modeling of holistic power usage for Cloud datacenter servers, and bridges a significant gap between computer science and mechanical engineering research. Model validation through experiments demonstrates an average standard error of 3\% for server power usage within both air and liquid cooled environments.",
    keywords = "Cloud Computing, Power and Energy, Datacenter",
    note = "{\textcopyright}ACM, 2016. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in UCC '16 Proceedings of the 9th International Conference on Utility and Cloud Computing http://dx.doi.org/10.1145/2996890.2996896",
    year = "2016",
    month = "December",
    day = "6",
    doi = "10.1145/2996890.2996896",
    language = "English",
    isbn = "9781450346160",
    pages = "11--19",
    booktitle = "UCC '16 Proceedings of the 9th International Conference on Utility and Cloud Computing",
    publisher = "ACM",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/135475519/UCC\_33\_Camera\_Copy\_Ready\_Version.pdf"
    }


  • [36] B. Primas, P. Garraghan, K. Djemame, and N. Shakhlevich, “Resource boxing: converting realistic cloud task utilization patterns for theoretical scheduling,” in UCC ’16 Proceedings of the 9th International Conference on Utility and Cloud Computing, 2016, p. 138–147. doi:10.1145/2996890.2996897
    [BibTeX] [Abstract] [pdf][Download PDF]

    Scheduling is a core component within distributed systems to determine optimal allocation of tasks within servers. This is challenging within modern Cloud computing systems – comprising millions of tasks executing in thousands of heterogeneous servers. Theoretical scheduling is capable of providing complete and sophisticated algorithms towards a single objective function. However, Cloud computing systems pursue multiple and oftentimes conflicting objectives towards provisioning high levels of performance, availability, reliability and energy-efficiency. As a result, theoretical scheduling for Cloud computing is performed by simplifying assumptions for applicability. This is especially true for task utilization patterns, which fluctuate in practice yet are modelled as piecewise constant in theoretical scheduling models. While there exists work for modelling dynamic Cloud task patterns for evaluating applied scheduling, such models are incompatible with the inputs needed for theoretical scheduling – which require such patterns to be represented as boxes. Presently there exist no methods capable of accurately converting real task patterns derived from empirical data into boxes. This results in a significant gap towards theoreticians understanding and proposing algorithms derived from realistic assumptions towards enhanced Cloud scheduling. This work proposes resource boxing – an approach for automated conversion of realistic task patterns in Cloud computing directly into box inputs for theoretical scheduling. We propose numerous resource conversion algorithms capable of accurately representing real task utilization patterns in the form of scheduling boxes. Algorithms were evaluated using production Cloud trace data, demonstrating a difference between real utilization and scheduling boxes less than 5\%. We also provide an application for how resource boxing can be exploited to directly translate research from the applied community into the theoretical community.

    @inproceedings{a85bbf3d064348aaabe065f69bf5c773,
    author = "Primas, Bernhard and Garraghan, Peter and Djemame, Karim and Shakhlevich, Natasha",
    title = "Resource boxing: converting realistic cloud task utilization patterns for theoretical scheduling",
    abstract = "Scheduling is a core component within distributed systems to determine optimal allocation of tasks within servers. This is challenging within modern Cloud computing systems – comprising millions of tasks executing in thousands of heterogeneous servers. Theoretical scheduling is capable of providing complete and sophisticated algorithms towards a single objective function. However, Cloud computing systems pursue multiple and oftentimes conflicting objectives towards provisioning high levels of performance, availability, reliability and energy-efficiency. As a result, theoretical scheduling for Cloud computing is performed by simplifying assumptions for applicability. This is especially true for task utilization patterns, which fluctuate in practice yet are modelled as piecewise constant in theoretical scheduling models. While there exists work for modelling dynamic Cloud task patterns for evaluating applied scheduling, such models are incompatible with the inputs needed for theoretical scheduling – which require such patterns to be represented as boxes. Presently there exist no methods capable of accurately converting real task patterns derived from empirical data into boxes. This results in a significant gap towards theoreticians understanding and proposing algorithms derived from realistic assumptions towards enhanced Cloud scheduling. This work proposes resource boxing – an approach for automated conversion of realistic task patterns in Cloud computing directly into box inputs for theoretical scheduling. We propose numerous resource conversion algorithms capable of accurately representing real task utilization patterns in the form of scheduling boxes. Algorithms were evaluated using production Cloud trace data, demonstrating a difference between real utilization and scheduling boxes less than 5\%. We also provide an application for how resource boxing can be exploited to directly translate research from the applied community into the theoretical community.",
    note = "{\textcopyright}ACM, 2016. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in UCC '16 Proceedings of the 9th International Conference on Utility and Cloud Computing http://dx.doi.org/10.1145/2996890.2996897",
    year = "2016",
    month = "December",
    day = "6",
    doi = "10.1145/2996890.2996897",
    language = "English",
    isbn = "9781450346160",
    pages = "138--147",
    booktitle = "UCC '16 Proceedings of the 9th International Conference on Utility and Cloud Computing",
    publisher = "ACM",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/136310743/Resource\_Boxing\_Camera\_Copy.pdf"
    }


  • [37] X. Ouyang, P. Garraghan, C. Wang, P. Townend, and J. Xu, “An approach for modeling and ranking node-level stragglers in cloud datacenters,” in 2016 IEEE International Conference on Services Computing (SCC), 2016, p. 673–680. doi:10.1109/SCC.2016.93
    [BibTeX] [Abstract] [pdf][Download PDF]

    The ability of servers to effectively execute tasks within Cloud datacenters varies due to heterogeneous CPU and memory capacities, resource contention situations, network configurations and operational age. Unexpectedly slow server nodes (node-level stragglers) result in assigned tasks becoming task-level stragglers, which dramatically impede parallel job execution. However, it is currently unknown how slow nodes directly correlate to task straggler manifestation. To address this knowledge gap, we propose a method for node performance modeling and ranking in Cloud datacenters based on analyzing parallel job execution tracelog data. By using a production Cloud system as a case study, we demonstrate how node execution performance is driven by temporal changes in node operation as opposed to node hardware capacity. Different sample sets have been filtered in order to evaluate the generality of our framework, and the analytic results demonstrate that node abilities of executing parallel tasks tend to follow a 3-parameter-loglogistic distribution. Further statistical attribute values such as confidence interval, quantile value, extreme case possibility, etc. can also be used for ranking and identifying potential straggler nodes within the cluster. We exploit a graph-based algorithm for partitioning server nodes into five levels, with 0.83\% of node-level stragglers identified. Our work lays the foundation towards enhancing scheduling algorithms by avoiding slow nodes, reducing task straggler occurrence, and improving parallel job performance.

    @inproceedings{98c33c2c0dca47f8959e3bc7eb2eb573,
    author = "Ouyang, Xue and Garraghan, Peter and Wang, Changjian and Townend, Paul and Xu, Jie",
    title = "An approach for modeling and ranking node-level stragglers in cloud datacenters",
    abstract = "The ability of servers to effectively execute tasks within Cloud datacenters varies due to heterogeneous CPU and memory capacities, resource contention situations, network configurations and operational age. Unexpectedly slow server nodes (node-level stragglers) result in assigned tasks becoming task-level stragglers, which dramatically impede parallel job execution. However, it is currently unknown how slow nodes directly correlate to task straggler manifestation. To address this knowledge gap, we propose a method for node performance modeling and ranking in Cloud datacenters based on analyzing parallel job execution tracelog data. By using a production Cloud system as a case study, we demonstrate how node execution performance is driven by temporal changes in node operation as opposed to node hardware capacity. Different sample sets have been filtered in order to evaluate the generality of our framework, and the analytic results demonstrate that node abilities of executing parallel tasks tend to follow a 3-parameter-loglogistic distribution. Further statistical attribute values such as confidence interval, quantile value, extreme case possibility, etc. can also be used for ranking and identifying potential straggler nodes within the cluster. We exploit a graph-based algorithm for partitioning server nodes into five levels, with 0.83\% of node-level stragglers identified. Our work lays the foundation towards enhancing scheduling algorithms by avoiding slow nodes, reducing task straggler occurrence, and improving parallel job performance.",
    keywords = "Servers, Production, Data models, Computational modeling, Analytical models, Time factors, Calculators",
    note = "{\textcopyright} 2016, IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.",
    year = "2016",
    month = "September",
    day = "1",
    doi = "10.1109/SCC.2016.93",
    language = "English",
    pages = "673--680",
    booktitle = "2016 IEEE International Conference on Services Computing (SCC)",
    publisher = "IEEE",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/134464104/An\_Approach\_for\_Modeling\_and\_Ranking\_Node\_level\_Stragglers\_in\_Cloud\_Datacenters.pdf"
    }


  • [38] X. Ouyang, P. Garraghan, R. Yang, P. Townend, and J. Xu, “Reducing late-timing failure at scale: straggler root-cause analysis in cloud datacenters.” 2016.
    [BibTeX] [Abstract] [pdf][Download PDF]

    Task stragglers hinder effective parallel job execution in Cloud datacenters, resulting in late-timing failures due to the violation of specified timing constraints. Straggler-tolerant methods such as speculative execution provide limited effectiveness due to (i) lack of precise straggler root-cause knowledge and (ii) straggler identification occurring too late within a job lifecycle. This paper proposes a method to ascertain underlying straggler root-causes by analyzing key parameters within large-scale distributed systems, and to determine the correlation between straggler occurrence and factors including resource contention, task concurrency, and server failures. Our preliminary study of a production Cloud datacenter indicates that the dominate straggler root-cause is resultant of high temporal resource contention. The result can assist in enhancing straggler prediction and mitigation for tolerating late-timing failures within large-scale distributed systems.

    @conference{dcff2961b690466ead97a7c2b0b07eef,
    author = "Ouyang, Xue and Garraghan, Peter and Yang, Renyu and Townend, Paul and Xu, Jie",
    title = "Reducing late-timing failure at scale: straggler root-cause analysis in cloud datacenters",
    abstract = "Task stragglers hinder effective parallel job execution in Cloud datacenters, resulting in late-timing failures due to the violation of specified timing constraints. Straggler-tolerant methods such as speculative execution provide limited effectiveness due to (i) lack of precise straggler root-cause knowledge and (ii) straggler identification occurring too late within a job lifecycle. This paper proposes a method to ascertain underlying straggler root-causes by analyzing key parameters within large-scale distributed systems, and to determine the correlation between straggler occurrence and factors including resource contention, task concurrency, and server failures. Our preliminary study of a production Cloud datacenter indicates that the dominate straggler root-cause is resultant of high temporal resource contention. The result can assist in enhancing straggler prediction and mitigation for tolerating late-timing failures within large-scale distributed systems.",
    year = "2016",
    month = "August",
    day = "18",
    language = "English",
    note = "46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2016 ; Conference date: 28-06-2016 Through 01-07-2016",
    url = "https://dsn-2016.sciencesconf.org/",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/134464689/Reducing\_Late\_Timing\_Failure.pdf"
    }


  • [39] Z. Wu, X. Li, P. Garraghan, X. Jiang, K. Ye, and {. Y. }. Zomaya, “Virtual machine level temperature profiling and prediction in cloud datacenters,” in 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS), 2016, p. 735–736. doi:10.1109/ICDCS.2016.62
    [BibTeX] [Abstract] [pdf][Download PDF]

    Temperature prediction can enhance datacenter thermal management towards minimizing cooling power draw. Traditional approaches achieve this through analyzing task-temperature profiles or resistor-capacitor circuit models to predict CPU temperature. However, they are unable to capture task resource heterogeneity within multi-tenant environments and make predictions under dynamic scenarios such as virtual machine migration, which is one of the main characteristics of Cloud computing. This paper proposes virtual machine level temperature prediction in Cloud datacenters. Experiments show that the mean squared error of stable CPU temperature prediction is within 1.10, and dynamic CPU temperature prediction can achieve 1.60 in most scenarios.

    @inproceedings{b369ab2855764d7b8ef749fb7a5582fa,
    author = "Wu, Zhaohui and Li, Xiang and Garraghan, Peter and Jiang, Xiaohong and Ye, Kejiang and Zomaya, {Albert Y.}",
    title = "Virtual machine level temperature profiling and prediction in cloud datacenters",
    abstract = "Temperature prediction can enhance datacenter thermal management towards minimizing cooling power draw. Traditional approaches achieve this through analyzing task-temperature profiles or resistor-capacitor circuit models to predict CPU temperature. However, they are unable to capture task resource heterogeneity within multi-tenant environments and make predictions under dynamic scenarios such as virtual machine migration, which is one of the main characteristics of Cloud computing. This paper proposes virtual machine level temperature prediction in Cloud datacenters. Experiments show that the mean squared error of stable CPU temperature prediction is within 1.10, and dynamic CPU temperature prediction can achieve 1.60 in most scenarios.",
    keywords = "Temperature, Mathematical model, Predictive models, Calibration, Servers, Cloud computing, Temperature measurement",
    note = "{\textcopyright} 2016 IEEE. . Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.",
    year = "2016",
    month = "August",
    day = "11",
    doi = "10.1109/ICDCS.2016.62",
    language = "English",
    pages = "735--736",
    booktitle = "2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS)",
    publisher = "IEEE",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/134464868/Virtual\_Machine\_Level\_Temperature\_Profiling.pdf"
    }


  • [40] P. Garraghan, S. Perks, X. Ouyang, D. McKee, and {. S. Moreno, “Tolerating transient late-timing faults in cloud-based real-time stream processing,” in 2016 IEEE 19th International Symposium on Real-Time Distributed Computing (ISORC), 2016, p. 108–115. doi:10.1109/ISORC.2016.24
    [BibTeX] [Abstract] [pdf][Download PDF]

    Real-time stream processing is a frequently deployed application within Cloud datacenters that is required to provision high levels of performance and reliability. Numerous fault-tolerant approaches have been proposed to effectively achieve this objective in the presence of crash failures. However, such systems struggle with transient late-timing faults – a fault classification challenging to effectively tolerate – that manifests increasingly within large-scale distributed systems. Such faults represent a significant threat towards minimizing soft real-time execution of streaming applications in the presence of failures. This work proposes a fault-tolerant approach for QoS-aware data prediction to tolerate transient late-timing faults. The approach is capable of determining the most effective data prediction algorithm for imposed QoS constraints on a failed stream processor at run-time. We integrated our approach into Apache Storm with experiment results showing its ability to minimize stream processor end-to-end execution time by 61\% compared to other fault-tolerant approaches. The approach incurs 12\% additional CPU utilization while reducing network usage by 44\%.

    @inproceedings{6913d10185204e06b07383686fbb2f7d,
    author = "Garraghan, Peter and Perks, Stuart and Ouyang, Xue and McKee, David and Moreno, {Ismael Solis}",
    title = "Tolerating transient late-timing faults in cloud-based real-time stream processing",
    abstract = "Real-time stream processing is a frequently deployed application within Cloud datacenters that is required to provision high levels of performance and reliability. Numerous fault-tolerant approaches have been proposed to effectively achieve this objective in the presence of crash failures. However, such systems struggle with transient late-timing faults - a fault classification challenging to effectively tolerate - that manifests increasingly within large-scale distributed systems. Such faults represent a significant threat towards minimizing soft real-time execution of streaming applications in the presence of failures. This work proposes a fault-tolerant approach for QoS-aware data prediction to tolerate transient late-timing faults. The approach is capable of determining the most effective data prediction algorithm for imposed QoS constraints on a failed stream processor at run-time. We integrated our approach into Apache Storm with experiment results showing its ability to minimize stream processor end-to-end execution time by 61\% compared to other fault-tolerant approaches. The approach incurs 12\% additional CPU utilization while reducing network usage by 44\%.",
    keywords = "Prediction algorithms, Real-time systems, Fault tolerance, Fault tolerant systems, Transient analysis, Quality of service, Predictive models",
    note = "{\textcopyright} 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.",
    year = "2016",
    month = "July",
    day = "21",
    doi = "10.1109/ISORC.2016.24",
    language = "English",
    pages = "108--115",
    booktitle = "2016 IEEE 19th International Symposium on Real-Time Distributed Computing (ISORC)",
    publisher = "IEEE",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/134464604/Submitted\_ISORC\_Real\_time\_Stream\_Processing.pdf"
    }


  • [41] X. Ouyang, P. Garraghan, D. McKee, P. Townend, and J. Xu, “Straggler detection in parallel computing systems through dynamic threshold calculation,” in 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA), 2016, p. 414–421. doi:10.1109/AINA.2016.84
    [BibTeX] [Abstract] [pdf][Download PDF]

    Cloud computing systems face the substantial challenge of the Long Tail problem: a small subset of straggling tasks significantly impede parallel jobs completion. This behavior results in longer service response times and degraded system utilization. Speculative execution, which create task replicas at runtime, is a typical method deployed in large-scale distributed systems to tolerate stragglers. This approach defines stragglers by specifying a static threshold value, which calculates the temporal difference between an individual task and the average task progression for a job. However, specifying static threshold debilitates speculation effectiveness as it fails to consider the intrinsic diversity of job timing constraints within modern day Cloud computing systems. Capturing such heterogeneity enables the ability to impose different levels of strictness for replica creation while achieving specified levels of QoS for different application types. Furthermore, a static threshold also fails to consider system environmental constraints in terms of replication overheads and optimal system resource usage. In this paper we present an algorithm for dynamically calculating a threshold value to identify task stragglers, considering key parameters including job QoS timing constraints, task execution characteristics, and optimal system resource utilization. We study and demonstrate the effectiveness of our algorithm through simulating a number of different operational scenarios based on real production cluster data against state-of-the-art solutions. Results demonstrate that our approach is capable of creating 58.62\% less replicas under high resource utilization while reducing response time up to 17.86\% for idle periods compared to a static threshold.

    @inproceedings{fec2284538f54f5888063a444f630876,
    author = "Ouyang, Xue and Garraghan, Peter and McKee, David and Townend, Paul and Xu, Jie",
    title = "Straggler detection in parallel computing systems through dynamic threshold calculation",
    abstract = "Cloud computing systems face the substantial challenge of the Long Tail problem: a small subset of straggling tasks significantly impede parallel jobs completion. This behavior results in longer service response times and degraded system utilization. Speculative execution, which create task replicas at runtime, is a typical method deployed in large-scale distributed systems to tolerate stragglers. This approach defines stragglers by specifying a static threshold value, which calculates the temporal difference between an individual task and the average task progression for a job. However, specifying static threshold debilitates speculation effectiveness as it fails to consider the intrinsic diversity of job timing constraints within modern day Cloud computing systems. Capturing such heterogeneity enables the ability to impose different levels of strictness for replica creation while achieving specified levels of QoS for different application types. Furthermore, a static threshold also fails to consider system environmental constraints in terms of replication overheads and optimal system resource usage. In this paper we present an algorithm for dynamically calculating a threshold value to identify task stragglers, considering key parameters including job QoS timing constraints, task execution characteristics, and optimal system resource utilization. We study and demonstrate the effectiveness of our algorithm through simulating a number of different operational scenarios based on real production cluster data against state-of-the-art solutions. Results demonstrate that our approach is capable of creating 58.62\% less replicas under high resource utilization while reducing response time up to 17.86\% for idle periods compared to a static threshold.",
    keywords = "Quality of service, Timing, Heuristic algorithms, Cloud computing, Time factors, Resource management, Production",
    note = "(c) 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.",
    year = "2016",
    month = "May",
    day = "23",
    doi = "10.1109/AINA.2016.84",
    language = "English",
    pages = "414--421",
    booktitle = "2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA)",
    publisher = "IEEE",
    pdf = "https://www.research.lancs.ac.uk/portal/services/downloadRegister/134464959/PID4055511\_final.pdf"
    }