XTune: Reliable and eXplainable Data Systems Tuning via Deep Learning
Professor: Jarek Szlichta
Contact Info: Email: szlichta@yorku.ca
Position Type: Lassonde Undergraduate Research Award (LURA); NSERC Undergraduate Student Research Award (USRA);
Open Positions: 1
Project Description: Modern data systems, such as IBM Db2, have dozens of system configuration parameters, commonly referred to as knobs. These parameters wield significant influence over the performance of business queries. Knobs are responsible for configuring various aspects, including the allocation of working memory, such as the number of pages allocated to the buffer pool and sortheap, the degree of parallelism to be used, and even toggle specific features by setting an optimization level. Manual configuration tuning experts are a labor-intensive and time-consuming process. Consequently, we propose XTune, a reliable and eXplainable, query-informed tuning system. XTune harnesses deep reinforcement learning (DRL) techniques based on actor-critic neural networks, specifically proximal policy optimization (PPO), to tune system configurations. Notably, the PPO policy is considered state-of-the-art by OpenAI, owing to its stability, sample efficiency, and robustness in addressing various reinforcement learning challenges. It computes updates at each step to minimize the loss function while ensuring minimal deviation from the previous policy. The optimization process includes strategies like introducing back pressure to manage resource utilization in cloud computing for sustainability purposes. It begins with the translation of high-dimensional query execution plans (QEPs) into a lower-dimensional space using embeddings derived from Bidirectional Encoder Representations from Transformers (BERT) and Graph Neural Networks (GNN), which then serve as inputs for the DRL models. In the context of large-scale machine learning models, their inherent complexity often renders them as “black boxes,” posing challenges for experts to decipher their prediction processes. The lack of interpretability within predictive models undermines the confidence experts place in these models, particularly in scenarios involving critical decisions, such data systems tuning. To tackle this issue and cultivate enhanced interpretability within data systems, our research introduces methods to generate saliency and counterfactual explanations, effectively transforming these black boxes into “glass boxes” that offer individuals insights into their internal mechanisms. Our saliency explanation method for tuning system configurations approximates the importance of model features, such as query subplans. On the other hand, our counterfactual explanations reveal what should have been different in queries and query execution plans (QEPs) in terms of perturbations to observe a diverse or desired outcome. To further enhance our approach, we implement an instance-based counterfactual strategy. This strategy outputs similar QEPs from the workload, rather than using arbitrary perturbations, resulting in a diverse tuning outcome. We evaluate our methods over synthetic and real query workloads, quantifying their effectiveness and performance benefits, particularly in the context of data lake-driven workloads. The development of XTune advances the reliability of data systems, while also aligning with the principles of sustainability, resulting in responsible technology usage. Ultimately, the impact of XTune resonates across industries, illustrating how responsible AI can drive positive change.
Duties and Responsibilities: Students’ duties and responsibilities will include: reviewing related work in automatic knobs tuning for data systems, designing large-scale machine learning-driven approaches to the tuning of configuration parameters, implementing the solution with the deep reinforcement learning model, conducting comprehensive experimental evaluation over synthetic and real-world query workloads, and writing a research paper to be submitted to one of the top-tier conferences in data science, such as VLDB, ACM SIGMOD, IEEE ICDE and EDBT.Desired Technical Skills: The student should possess algorithmic design and development knowledge, as well as demonstrate strong programming skills.
Desired Course(s): It is recommended to have completed some of the data science courses such as LE/EECS 3405 3.00 – Fundamentals of Machine Learning, LE/EECS 3421 3.00 – Introduction to Database Systems, LE/EECS 4415 3.00 – Big Data Systems, LE/EECS 4411 3.00 – Database Management Systems, LE/EECS 4412 3.00 – Data Mining etc.
Other Desired Qualifications: Other qualifications include good communication skills.
CORAL: COncept-based Explanations for RAG LLMs
Professor: Jarek Szlichta
Contact Info: Email: szlichta@yorku.ca
Position Type: Lassonde Undergraduate Research Award (LURA);NSERC Undergraduate Student Research Award (USRA);
Open Positions: 1
Project Description: Large language models (LLMs) that use retrieval-augmented generation (RAG) are increasingly deployed to answer broad, open-ended questions. However, when a user asks something like “What is the best treatment for a migraine?”, there is no single correct answer. The response depends on what the user means by “best” (fastest relief, fewest side effects, natural alternatives) and on how the model interprets the question through the documents it retrieves (scientific papers, clinical guidelines, online forums, or patient blogs). Current systems provide none of this structure to the user, and so the underlying variability, reasoning, and dependence on intent or interpretation remain opaque. This project will contribute to CORAL, a proof-of-concept tool designed to make such answer variability transparent by organizing the space of possible LLM outputs into semantic concepts. The core idea behind CORAL is to treat answer criteria (user intents) and source interpretations (types of retrieved documents) as concepts that can be combined in different ways. Each combination forms a node in a concept of lattice. At each node, the system prompts the LLM using a reworded version of the original question that specifies that node’s intents or interpretations and records the resulting answer. Navigating the lattice allows a user to visualize how answers vary, understand counterfactuals (“what changes if I care about natural remedies rather than clinical effectiveness?”), and explore the minimal adjustments to the prompt or retrieval sources that yield alternative responses. This conceptual perspective differs from prior work that attempts to explain LLMs mechanically (e.g., tracing facts to neurons) or analyses that only examine how RAG sources influence outputs. Instead, CORAL aims to expose meaningful high-level dimensions along which answers differ. Populating a full lattice naively would require many LLM inference calls, so the project explores pruning strategies that reduce computation without losing important variations. One strategy extracts intent automatically from the model’s own reasoning traces, using graph-of-thoughts outputs to identify which answer criteria matter. Another strategy prompts the model separately with each individual intent, asks it to list and rank plausible answers, and uses rank-aggregation techniques to approximate answers for combinations of intents before selectively querying only those combinations likely to differ. A third strategy uses the model’s reasoning to identify which types of sources are likely to shift the answer, allowing selective exploration of alternative interpretations. A student working on this project will perform research along these directions, experiment with pruning methods, evaluate how well they capture true answer variation, and improve the interactive interface for exploring answer spaces. The work combines LLM prompting, RAG pipelines, data-structure design, concept lattices, and explainability. The outcome will be an improved prototype and demonstration showing how users can more clearly understand the range of plausible answers an LLM might generate and how intent and interpretation shape those answers, with applications to query refinement, medical information exploration, and safer LLM deployment.
Duties and Responsibilities: Students’ duties and responsibilities will include: reviewing related work in RAG LLMs, designing large-scale machine learning-driven approaches to explainable AI, implementing the solution, conducting comprehensive experimental evaluation over real-world datasets, and writing a research paper to be submitted to one of the top-tier conferences in data science, such as VLDB, ACM SIGMOD, IEEE ICDE and EDBT.
Desired Technical Skills: The student should possess algorithmic design and development knowledge, as well as demonstrate strong programming skills.
Desired Course(s): It is recommended to have completed some of the data science courses such as LE/EECS 3405 3.00 – Fundamentals of Machine Learning, LE/EECS 3421 3.00 – Introduction to Database Systems, LE/EECS 4415 3.00 – Big Data Systems, LE/EECS 4411 3.00 – Database Management Systems, LE/EECS 4412 3.00 – Data Mining etc.
Other Desired Qualifications: Other qualifications include good communication skills.
Human-Computer Interaction in Virtual Reality
Professor: Robert Allison
Contact Info: Email: rallison@yorku.ca
Position Type: Lassonde Undergraduate Research Award (LURA);NSERC Undergraduate Student Research Award (USRA);
Open Positions: 2
Project Description: Students will help design, develop and conduct experiments related to human-computer interaction in virtual environments and digital media. In our lab we have a wide range of apparatus to study human perception in computer-mediated worlds including a new and unique fully immersive virtual environment display.
Duties and Responsibilities: The student would develop interactive 3D virtual worlds and conduct experiments to study self-motion perception, visual perception and human computer interaction in these virtual worlds. In particular, working with a senior graduate student or postdoctoral fellow, the successful applicant would model 3D environments, render them in a virtual reality or other digital media display, develop/implement interaction methods to control and interact with the simulation, and/or develop and run experimental scenarios to investigate these issues with human participants.
Desired Technical Skills: Programming, experimental design, computer graphics, HCI, statistics, presentations and technical writing
Desired Course(s): Programming, computer graphics, HCI, statistics
Other Desired Qualifications: Ability to work well in a team environment.
Using Graphene as a Material for Tritium and Deuterium Separation in Water
Professor: Simone Pisana
Contact Info: Email: pisana@yorku.ca
Position Type: Lassonde Undergraduate Research Award (LURA);NSERC Undergraduate Student Research Award (USRA);
Open Positions: 2
Project Description: Students will help our group’s effort to devise a way to efficiently separate deuterium and tritium in water. Deuterium is a heavy form of hydrogen, it is naturally occurring and is an important resource used in the nuclear, medical, and chemical industries. Tritium is a rare radioactive isotope of hydrogen and is considered a waste product from nuclear generation stations but an element of strategic importance for future fusion power generation programs. Therefore, devising ways to economically separate deuterium and tritium-containing water molecules from water mixtures is of important for clean and economical power generation, among other applications. The work is in partnership with a local company, so there are opportunities to interact and share progress.
Duties and Responsibilities: Depending on the student’s aptitudes and interest, the responsibilities include: isotopic testing of water samples (nuclear magnetic resonance, mass spectroscopy or infrared spectroscopy), fabrication of graphene-based water filters, characterization of filter samples, assisting graduate students with their projects, conducting studies and analyzing the results.
Desired Technical Skills: Familiarity with the structure of materials (crystals, defects, bonding), basic chemistry techniques.
Desired Course(s): Introductory courses in materials science (i.e. CHEM 1100 Chemistry and Materials Science for Engineers), chemistry (i.e. CHEM 1001 Chemical Dynamics, CHEM 2011 Introduction to Thermodynamics), physical electronics or solid state physics (i.e. EECS 3610 Semiconductor Physics and Devices), optics (i.e. EECS 4614 Electro-Optics).
Other Desired Qualifications: Great organizational skill, record-keeping, hands-on skill, ability to work well both alone or as part of a team, resourcefulness, creativity, and problem-solving skills.
Privacy Analysis of Large Language Models (LLM)
Professor: Yan Shvartzshnaider
Contact Info: rhythm.lab@yorku.ca
Lab Website: https://www.yorku.ca/lassonde/privacy/
Position Type: Lassonde Undergraduate Research Award (LURA); NSERC Undergraduate Student Research Award (USRA)
Open Positions: 2
Project Description: The rapid shift toward digital platforms has introduced new privacy and security challenges across various sectors, including workplaces, healthcare, and education. These technologies collect and share vast amounts of data about users and their environments. However, our collective understanding of privacy expectations often lags the rapid advancements in technology and information-handling practices. Researchers worldwide are working to develop new methods to systematically analyze the ethical and privacy implications of these tools, aiming to prevent potential societal harm.
These issues are particularly pressing with the emergence of Large Language Models (LLMs), which require training on enormous datasets, raising significant concerns about data privacy.
To address this, the project will explore the ability of LLMs to adhere to context-specific privacy norms, using the theory of Contextual Integrity.
Looking for students in 3rd year or higher.
Duties and Responsibilities:
Students will assist in analyzing various LLM models.
Specific tasks include:
1) conducting a comprehensive literature review of existing privacy-related LLM methodologies
2) investigating the model’s properties, such as capacity, alignment, few-shot prompting, and chain-of-thought prompting, with a focus on aligning these models with privacy norms and expectations;
3) contributing to the development of a framework aimed at improving LLM alignment by fine-tuning the models to better align with existing policies and expectations.
See this paper for reference: https://arxiv.org/abs/2409.03735.
Desired Technical Skills:
* Proven programming and data analysis skills
* Experience working with ML and LLM models.
Desired Course(s): Course on ML and LLM, Data analysis.
Other Desired Qualifications:
* Experience with data analysis using Jupyter and/or R
* Interest in usable privacy, critical analysis of systems, and of privacy-related regulations
* Ability to work independently
Examination of Privacy Practices in Apps and Digital Platforms
Professor: Yan Shvartzshnaider
Contact Info: rhythm.lab@yorku.ca
Lab Website: https://www.yorku.ca/lassonde/privacy/
Position Type: Lassonde Undergraduate Research Award (LURA); NSERC Undergraduate Student Research Award (USRA)
Open Positions: 2
Project Description: Modern sociotechnical systems share and collect vast amounts of information. These systems violate users’ privacy by ignoring the context in which the information is shared and by implementing privacy models that fail to incorporate contextual information norms. Using techniques in natural language processing, machine learning, network, and data analysis, this project is set to explore the privacy implications of mobile apps, online platforms, and other systems in different social contexts/settings.
To tackle this challenge, the project will operationalize a cutting-edge privacy theory and methodologies to conduct an analysis of existing technologies and design privacy-enhancing tools used in the educational context.
Looking for students in 3rd year or higher.
Duties and Responsibilities: Students will develop privacy-enhancing mechanisms that ensure that information flows in accordance with users’ expectations and established societal norms and design. Specific tasks include: comprehensive literature review of existing methodologies and tools, analysis of privacy policies and regulations, visualization of information collection practices, and design of a web-based interface for analyzing extracted privacy statements to identify vague, misleading, or incomplete privacy statements.
Desired Technical Skills: Experience with machine learning, natural language processing techniques, Good programming skills overall, and experience in using Jupyter and/or R for data analysis.
Desired Course(s): Software engineering, computer science, and information science students. Note: students with diverse backgrounds, including in technical fields, social sciences and humanities are encouraged to apply.
Other Desired Qualifications: HCI design and web development. interest in usable privacy, critical analysis of privacy policies and privacy related regulation.