| Day 0: Friday, August 1, 2025 | ||
| Title, abstract, and more info | Presenter(s) | |
|---|---|---|
| Virtual | ||
| TBD | A Robust and Informative Application for viewing the dataframes in RMore infoIn R programming, the View() function from the Utils package provides a basic interface for viewing the dataframe. The current R dataframe interface does not have features such as column selection, complex filtering, data-type of the variables, variable meta-data information, code reproducibility and download options. This presentation ([please click on this link to view my draft presentation][1]) will demonstrate a newly created feature-rich application that includes all above features. The Application was created using shiny modules for viewing and examining the dataframes from various statistical softwares such as SAS, Python. [1]: https://docs.google.com/presentation/d/1Mygbx-15iYyd8CVh7sgxtuhtteQ6hFZM/edit?usp=drivesdk&ouid=103296676447663833578&rtpof=true&sd=trueDate and time: Fri, Aug 1, 2025 - TBD Author(s): Madhan Kumar Nagaraji Keyword(s): statistical programming, clinical trials data, dataset interface, workflow Video recording available after conference: ✅ |
Madhan Kumar Nagaraji |
| TBD | A first look at PositronMore infoPositron is a next generation data science IDE built by the creators of RStudio. It has been available for beta testing for a number of months, and R users may have wondered if they should try it or if it will be a good fit for them. This new IDE is an extensible tool built to facilitate exploratory data analysis, reproducible authoring, and publishing data artifacts, and it is an IDE that supports but is not built only for R. How should an R user think about Positron, compared to the other options out there? In this talk, learn about how and why Positron is designed the way it is, what will feel familiar or new coming from other IDEs such as RStudio, and when (or if) people who use R should consider giving it a try. You’ll hear about different choices when it comes to defaults and ways of working, such as how to think about your projects or folders and how to manage multiple versions of R. You will also learn about new functionality for R users and package developers that we have never had before, like new approaches for managing R package tests and the ability to customize an IDE using extensions. If you are curious about Positron and how it fits into the R ecosystem, you’ll come away from this talk with more details about its capabilities and more clarity about whether it may be a good choice for you.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Julia Silge (Posit PBC) Keyword(s): ide, workflow, tooling Video recording available after conference: ✅ |
Julia Silge (Posit PBC) |
| TBD | Analyzing Census Data in R: Techniques and ApplicationsMore infoThis talk provides an introduction to working with IPUMS Census American Community Survey (ACS) data in R, focusing on key techniques for data preparation, weighting, and sampling. Participants will gain a foundational understanding of how Census data is structured and learn how to apply statistical weights to create representative analyses. The talk also explores the role of Census data in artificial intelligence (AI) and machine learning (ML), highlighting its applications in model training, fairness assessments, and demographic insights. Finally, the course addresses the critical use of Census data in anti-discrimination frameworks, demonstrating how demographic techniques can help evaluate bias and promote equitable AI/ML outcomes. Through practical exercises and case studies, participants will develop essential skills for integrating Census data into AI-driven analyses with an emphasis on equity.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Joanne Rodrigues Keyword(s): demography, frameworks, census data, equity ml/ai, anti-discrimination in ml/ai Video recording available after conference: ✅ |
Joanne Rodrigues |
| TBD | Automating workflows with webhooks and plumber in RMore infoWebhooks have brought to us new possibilities for automating workflows. With such, we can eliminate the need for manual interventions. In this talk, I will demonstrate how you can use plumber, an R package for building APIs, to create a webhook listener that triggers your workflows such as updating dashboards, triggering machine learning retraining, and other potential use cases. In the presentation, I will cover how to process payloads - using GitHub webhooks as an example, verifying authenticity using HMAC signatures, and implementing logging for tracking script execution, debugging, and monitoring. This talk will benefit researchers, data scientists, and developers who want to make their R workflows responsive to certain triggers.Date and time: Fri, Aug 1, 2025 - TBD Author(s): CLINTON DAVID Keyword(s): automation, event-driven workflows, plumber api, github webhooks Video recording available after conference: ✅ |
CLINTON DAVID |
| TBD | Beyond Guesswork: How Econometric Models (MMMs) Guide Genius Marketing DecisionsMore infoIn a world where marketing budgets are scrutinised, customer journeys are more fragmented than ever, and every digital channel feels as though it’s taking over the world - how do you really know where to invest? Traditional measurement methods often fall short, leaving marketers with more questions than answers: Is TV a dead channel? Is last-click attribution selling your brand-building efforts short? Are you overinvesting in one channel while under-utilising another? In this talk, we will take you beyond surface-level metrics and into the world of econometrics in R: the gold standard for understanding marketing effectiveness. By applying time-series regression techniques, we can move past vanity metrics and gut feelings to uncover the real impact of marketing spend. Attribution is one of the biggest debates in marketing measurement, therefore, this talk will also explore whether first-click, last-click, or equal weightings really make sense - or whether a more nuanced approach is needed to reflect consumer behaviour. Next, we’ll explore how to quantify return on investment with rigor, determine the optimal allocation of budget across channels, visualise the relationship between channels, and understand diminishing returns to avoid wasted spend. All while breaking down key marketing acronyms, campaign types, and measurement approaches to ensure you leave with a clear understanding of how to apply these concepts in the real world. Finally, we will demonstrate our award-winning (IPA, 2024) example case study built in R with December19 Media Agency, for Xero Accounting UK, to bring our time-series regression analyses to life. If you’re looking for a session that moves beyond generic reporting and 6-figure marketing agency prices – this talk is made for you!Date and time: Fri, Aug 1, 2025 - TBD Author(s): Abbie Brookes (Data Scientist @ Datacove), Jeremy Horne (Director @ Datacove); Abbie Brookes (Data Scientist @ Datacove) Keyword(s): marketing, statistical modelling, econometrics, measurement, regression Video recording available after conference: ✅ |
Abbie Brookes (Data Scientist @ Datacove) Jeremy Horne (Director @ Datacove) |
| TBD | CSV to Parquet: Managing data for multi-language analytics teamsMore infoCSV is arguably the default data storage format for analytics teams. CSV format is advantageous for its simplicity. When the data size is small, it is easy to inspect the CSV data using a spreadsheet program. However, CSV files tend to become very slow for read and write operations at larger data sizes. Enter Apache Parquet > Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools. From Apache Parquet Documentation The talk would focus on an overview of the Apache Parquet format and advantages compared to the CSV format. I would also demonstrate reading and writing data in this format in R and show the interoperability with Apache Arrow. Further, I would demonstrate how this format would make life easier for polyglot teams that use R and Python. Finally, the session would end by mentioning key points to consider to decide their storage format. The participants would get introduced to Apache Parquet and Arrow and be able to better decide storage format for their workflows. Broad Agenda 1. Overview of the Apache Parquet format 2. Benchmarking with CSV 3. Reading and writing data in R and Python 4. Interoperability with Arrow and Pyarrow 5. Conclusion and TakeawaysDate and time: Fri, Aug 1, 2025 - TBD Author(s): Viswadutt Poduri Keyword(s): data processing, parquet, analytics, big data, storage Video recording available after conference: ✅ |
Viswadutt Poduri |
| TBD | Data Visualization for Exploratory Factor AnalysisMore infoExploratory factor analysis (EFA) is routinely used by researchers to reduce dimensionality of data, and to form meaningful factors. While there are good guidelines on how to report the results, data visualization tools are rarely used in understanding the results of EFA. Good data visualization, especially in a multivariate framework with ordinal data, makes it easier for people interpret the results of the analysis better. This presentation introduces and demonstrates different data visualization techniques that can be used to illustrate the results of EFA and improve its interpretation. Advantages and disadvantages of each of these techniques are discussed. As EFA is oftentimes used on categorical data from survey research, apart from visualizations for the factor results, exploratory data visualization for categorical variables are also presented. Moreover, these graphical tools can be used for purposes other than EFA. Data visualization and analysis is performed in R using publicly available survey research data.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Nivedita Bhaktha (Indian Institute of Technology Kanpur) Keyword(s): factor analysis, exploratory data analysis, dimension reduction, ordinal data, survey research Video recording available after conference: ✅ |
Nivedita Bhaktha (Indian Institute of Technology Kanpur) |
| TBD | Don’t Write Code Your Users Haven’t Asked ForMore infoThe fact that your code works doesn't mean it's useful to your users. Ensuring that code works correctly with unit-testing is well-established in R, but validating that we write the correct code—aligned with user needs—remains a challenge. In this talk, we’ll explore how Behavior-Driven Development helps collaborating with stakeholders by translating their needs into automated tests that check if our software satisfies them. You’ll walk away with an understanding of how to start practicing BDD in R with {cucumber}: a method of producing tests that describe what your software does, and makes tests easier to maintain as your software evolves.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Jakub Sobolewski Keyword(s): testing, behavior-driven development, test-driven development, efficient programming, gherkin Video recording available after conference: ✅ |
Jakub Sobolewski |
| TBD | Experimenting with LLM Applications in RMore infoLarge language models (LLMs) are surprisingly easy to use and at their core, they're just an API call away. But how do you go from calling a model to actually building something useful? In this talk, I'll share my experiences creating and deploying LLM-powered applications in R. I'll walk through different approaches I've tried, from incorporating LLMs into Shiny apps, improving my R code, and experimenting with different models and deployment options. Along the way, I'll highlight what worked, what didn't, and what I learned in the process. Whether you're curious about integrating LLMs into your own projects or just want to see what's possible, this session will offer a practical look at building with LLMs in R without the hype.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Nic Crane Keyword(s): automation, llms, ai Video recording available after conference: ✅ |
Nic Crane |
| TBD | From Data to Narrative: Interactive Storytelling with ShinyMore infoData Storytelling transforms complex datasets into clear, engaging narratives, combining analysis, visualization, and storytelling to inspire action and facilitate decision-making. This session focuses on using Shiny to craft compelling stories through dynamic, interactive applications that turn raw data into impactful insights. Through live demonstrations, attendees will discover how Shiny (via Shiny-live in Quarto) bridges the gap between data and storytelling, empowering developers to create interactive dashboards that communicate complex ideas with clarity and impact. The session will highlight practical examples and best practices for building stories that resonate with diverse audiences. By the end of the session, participants will not only understand how to use Shiny to build interactive dashboards but also how to leverage these tools to create meaningful, audience-focused narratives. Shiny-live will be demonstrated as a key enabler of engaging, visually appealing, and interactive data storytelling.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Francisco Alfaro (USM) Keyword(s): quarto, shiny, data storytelling, interactive dashboards, visualization Video recording available after conference: ✅ |
Francisco Alfaro (USM) |
| TBD | Health care data harmonization using Shiny, clinal experts, and RDBMSMore infoIn support of a large international, multi-site health care project that developed a new pediatric sepsis criteria, we created a pipeline to allow clinical experts to harmonize medications, observations, events, and laboratory measurements from electronic medical record extracts. This pipeline was instrumental in allowing the review and use of 2.2 billion rows/175 GB of source data. During the process of developing the sepsis criteria, we received multiple new data deliveries from each site, which required frequent review and re-harmonization of the provided source datasets. This harmonization pipeline consisted of multiple steps including conflating multiple source rows types to one harmonized type, performing source specific unit mapping, and performing value transformations. In an iterative process, clinical experts would identify rows for mapping, data scientists would run the harmonization pipeline, and then clinical experts would review mapped data using Shiny tools custom built for this project. Due to project and dataset size, we leveraged a range of tools including Google BigQuery, R, and make. After harmonization, the cleaned dataset was approximately 1.7 billion rows/155GB in size. This large amount of data required special considerations to perform acceptably. To keep Shiny responsive, to keep the server hosting our Shiny apps from crashing, and to prevent client browser crashes, we had to limit data being reviewed to at most a random sample of 50% of the larger data groupings. ![Application Screenshot][1] [1]: https://private-user-images.githubusercontent.com/9376248/421070526-1ced6919-a0f8-4dac-9561-afb442c48161.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NDE2MzcyODksIm5iZiI6MTc0MTYzNjk4OSwicGF0aCI6Ii85Mzc2MjQ4LzQyMTA3MDUyNi0xY2VkNjkxOS1hMGY4LTRkYWMtOTU2MS1hZmI0NDJjNDgxNjEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDMxMCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAzMTBUMjAwMzA5WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZWZlZmE2YTNmZjY4MzBiYThiYjkxYTljMjI3NmVmYmI5NDIwM2IxOTE4NjM2M2IwODExOGEwZWYwMjgyODgwYiZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.zYshCSHQCOw9AKIrUIEb15KvmI2lqsls0s-wjxhxE2EDate and time: Fri, Aug 1, 2025 - TBD Author(s): Seth Russell (University of Colorado Anschutz Medical Campus) Keyword(s): big data, shiny, healthcare, data harmonization, rdbms Video recording available after conference: ✅ |
Seth Russell (University of Colorado Anschutz Medical Campus) |
| TBD | Intracranial Pressure Monitor Placement Prediction in Children with Traumatic Brain InjuryMore infoTraumatic brain injury causes approximately 2,200 deaths and 35,000 hospitalizations in U.S. children annually. Clinicians currently make decisions about placing an intracranial pressure (ICP) monitor in children with traumatic brain injury without the benefit of an accurate clinical decision support tool. In a prospective observational cohort study, we developed and validated models that predict placement of an ICP monitor. Patient data was gathered from multiple sources and discretized into 5-minute intervals. We divided data into four combinations of nurse documented and chart extracted input data, all including patient level and vital sign variables, and with inclusion or exclusion of data from brain computed tomography imaging reports and invasive blood pressure readings. Using R, we built machine learning models using logistic regression, support vector machines, generalized estimating equations, generalized additive models, and LSTMs. We trained each model with each combination of data. Optimal parameters were identified based on the highest F1. The best performing model, an LSTM deep learning model, achieved an F1 of 0.71 within 720 minutes of hospital arrival. The best non-neural network model, standard logistic regression, achieved an F1 of 0.36 within 720 minutes of hospital arrival. While non-RNN models did not achieve the best F1, their coefficient size and direction provide insight into factors predicting ICP monitor placement. Additionally, the generalized additive models allow for visualization and interpretation of the marginal impact (after integrating out the impact of the other variables) of a variable over time.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Seth Russell (University of Colorado Anschutz Medical Campus) Keyword(s): deep learning, machine learning, healthcare, decision making Video recording available after conference: ✅ |
Seth Russell (University of Colorado Anschutz Medical Campus) |
| TBD | Plot Twist: Adding Interactivity to the Elegance of ggplot2 with ggiraphMore infoOne of the most common critiques of ggplot2 is its lack of built-in interactivity. While static plots are powerful for storytelling, interactive visualizations can enhance exploration, engagement, and accessibility. The ggiraph package finally provides a seamless way to add interactivity to ggplot2—enabling hover effects, tooltips, and clickable elements—while preserving the familiar layered approach and custom theming. In this talk, Tanya Shapiro and Cédric Scherer will demonstrate why ggiraph stands out among other solutions, such as plotly, and how it integrates effortlessly with ggplot2 and its extension ecosystem. We’ll walk through real-world examples, explore its key functionalities, and share practical tips for creating engaging and well-designed interactive visualizations with ggiraph. Whether you're looking to make your research more engaging, enhance dashboards, or create interactive reports, this talk will provide a solid foundation for elevating your data storytelling with interactive visualizations.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Cédric Scherer (Independent Contractor), Tanya Shapiro (Independent Contractor) Keyword(s): data visualization, ggplot2, interactive charts, storytelling, dashboard Video recording available after conference: ✅ |
Cédric Scherer (Independent Contractor) Tanya Shapiro (Independent Contractor) |
| TBD | RDepot - 100% open source enterprise management of R and Python repositoriesMore infoRDepot is a solution for the management of R and Python package repositories in an enterprise environment. It allows to submit packages through a user interface or API and to automatically update and publish R and Python repositories. Multiple departments can manage their own repositories and different users can have different roles in the management of their packages. With continuous integration infrastructure for quality assurance on R and Python packages, package uploads can be automated. All configuration is declarative and RDepot can be set up as infrastructure as code, which is especially relevant in regulated contexts, since it makes validation activities much easier. Packages from publicly available R repositories such as CRAN and Bioconductor can be mirrored selectively in custom repositories for use behind a firewall, in internal networks and offline. Combined with Crane, authentication and fine-grained authorization (using OpenID Connect) can be configured per repository, which offers extra security when dealing with sensitive data or sensitive methodology. In this talk we will walk R users and developers through different features of RDepot and demonstrate how these can be useful in different scenarios. The logic of the different workflows will be explained and live demos will be given to see the open source solution in action. We will make sure to address needs ranging from small research groups sharing a handful of packages up to multinational companies managing their R (and Python) code across the globe.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Jonas Van Malder Keyword(s): package management, infrastructure, open source Video recording available after conference: ✅ |
Jonas Van Malder |
| TBD | Sharing data science artifacts across teams using CraneMore infoDo you have to share many data science artifacts across teams? This is a problem for many data science organizations and can now be solved using a novel open source product Crane (https://craneserver.net/). Crane hosts data science artifacts such as data analysis reports, documentation sites, or even packages and libraries. All of these data science artifacts are kept under strict authentication and authorization using modern protocols (OIDC). In this talk, we walk you through the different features of Crane and provide a live demo to explain the concepts. We will discuss its configuration file and demonstrate that authentication in Crane is fully declarative and allows for fine-grained configuration (at user-level, group-level, network-level or using SpEL) while still using an intuitive hierarchical tree that corresponds to the directory structure of the data. Next, we will show how artifacts can be accessed from or uploaded into Crane using the Crane API from R (e.g. to automate report updates, use data science artifacts in CI/CD) or using its customizable UI. Further, we zoom in on audit logs to track operations on all files (e.g. for GxP purposes) and detail the different storage backends (S3 and local file system). To ensure Crane can perform in high security settings the code base has been tested using integration tests reaching a high code coverage of more than 70%. With this talk we want to teach any R user and developer the essentials of Crane and how it can be used to share their data science artifacts.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Lucianos Lionakis (Open Analytics) Keyword(s): data sharing, data, automation, r, repository managment Video recording available after conference: ✅ |
Lucianos Lionakis (Open Analytics) |
| TBD | Shiny Policies: Customised Dashboards to Aid British Government DecisionsMore infoShiny dashboards are a powerful tool for visualising and interacting with data, but without thoughtful design, they can feel generic, clunky, or even inaccessible to key users. In this talk, we will explore how to take Shiny beyond it’s default appearance to create dashboards that are not only visually appealing but also highly usable, accessible, and seamlessly integrated into an organisation’s digital environment. To demonstrate this to our audience, we will share our open-source dashboard in collaboration with the British Department for Environment, Food and Rural Affairs (DEFRA). While the project required thorough data integration and analysis, one of the biggest challenges was ensuring the dashboard was not just functional but also visually cohesive, highly accessible, and intuitive for a broad range of users—including policymakers with varying levels of data literacy. We’ll start by discussing how to balance the line between over-simplifying and over-complicating data. Like most open-source data, there is a vast library of data, with little documentation on how to interpret it. Therefore, how to optimise open-source government underpins this talk – to ensure interactivity and efficient rendering techniques, so we can keep dashboards responsive and user-friendly. Next, we will be jump into customisation now that the foundations are in place - looking at how custom CSS and JavaScript can be leveraged to break free from the typical Shiny aesthetic, ensuring dashboards align with existing brand guidelines and user expectations. From typography and colour schemes to interactive elements, we’ll discuss techniques to create a polished, professional design that feels like a natural extension of an organisation’s existing web presence. Accessibility is another key factor in dashboard design. Many users—whether government policymakers, corporate stakeholders, or public audiences—have varying levels of data literacy, and a poorly designed interface can create barriers to insight. We’ll cover strategies for making dashboards more intuitive, including thoughtful navigation structures, tooltips, dynamic summaries, and alternative ways to display data for users with different needs. Additionally, we’ll explore best practices for ensuring compliance with accessibility standards, such as improving contrast, enabling keyboard navigation, and implementing screen reader-friendly elements. By the end of this session, you’ll have a clear understanding of how to design Shiny dashboards that are not just functional but genuinely enjoyable to use - helping your audience engage with data more effectively and make better-informed decisions with open-source data.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Abbie Brookes, Jeremy Horne (Director @ Datacove); Abbie Brookes Keyword(s): shiny apps, dashboard, environmental science, health science, decision-making, customisation Video recording available after conference: ✅ |
Abbie Brookes Jeremy Horne (Director @ Datacove) |
| TBD | ShinyProxy: easily deploy your Shiny appsMore infoShinyProxy (https://shinyproxy.io/) is a 100% open source framework to deploy Shiny (and other) apps or web-based IDEs (like RStudio). Because of it's flexibility, ShinyProxy is being used by both small startups and large enterprises. Although ShinyProxy was originally tailored towards hosting Shiny apps, it can host virtually any web app. Since ShinyProxy makes it easy to make reproducible apps, even when using multiple R versions, it's often used by pharmaceutical companies. Nevertheless, it's used by financial and engineering companies as well. ShinyProxy seamlessly integrates with your existing infrastructure (such as authentication providers and databases). The purpose of this talk is to give an introduction to ShinyProxy, explain the use-cases of ShinyProxy and it's unique advantages over other solutions. No deep technical knowledge (e.g. of Docker or Linux) is needed to follow this talk, however, the talk will give you enough information to start using ShinyProxy yourselves. As usual, the development of ShinyProxy has continued, therefore we'll also give a preview of upcoming features.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Tobia De Koninck (Open Analytics NV) Keyword(s): shiny,automation,docker,webapp Video recording available after conference: ✅ |
Tobia De Koninck (Open Analytics NV) |
| TBD | System Design for Shiny Developers: The Comprehensive Deployment ArchitectureMore infoThe presentation discusses the aspects of application development and deployment that are spanning beyond the Shiny application itself: data storage and access, user management and authentication, observability and telemetry, multi-lingual microservices for complex task delegation, caching, and more. All these infrastructural elements can be created and used with the Free and open-source software such as the Docker engine, PostgreSQL, OpenLDAP, Shinyproxy, R language and various R packages, etc. The entire system of all services communicating with each other is facilitated by Docker-Compose and can be mapped on a single diagram. The diagram is presented during the talk to provide a clear understanding and a high-level overview of system design concepts. Author will also present practical examples, guidelines and tips on how to design and ship a complete solution from scratch. After the talk the audience may expect virtual handout materials provided through the means of a Github repository, which can be used as a starting template for their own projects.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Pavel Demin (Appsilon) Keyword(s): shiny, shinyproxy, system design, microservices, docker Video recording available after conference: ✅ |
Pavel Demin (Appsilon) |
| TBD | The Future of Asynchronous Programming in RMore infoAsynchronous programming can be a powerful paradigm, whereby computations are allowed to run concurrently without blocking the main session. It is an opportune time to survey the current landscape, as R infrastructure in this respect has matured significantly over recent years. Instead of running a script sequentially from top to bottom, logic that takes a long or unpredictable amount of time to complete may be offloaded to different R processes, possibly on other computers or in the cloud. In the meantime, the main session may be running constantly and non-interactively, performing operations in real time, synchronizing with these tasks only when necessary. This style of programming requires a very specific set of tooling. At the very base, there is an infrastructure layer involving key enabling packages such as later and mirai. It will be explained at a high level why these two packages together currently offer the most complete and efficient implementation of async for the R language. There are further tools which expand async functionality to cover specific needs, such as the watcher package for filesystem monitoring. There are then a range of tools built on top of these, bringing async capabilities to the end-user, such as the httr2 package for querying APIs and the ellmer package for interacting with LLMs. In addition to these existing tools, exciting developments in asynchronous programming are just around the corner. These will be previewed, together with speculation on what might be possible at some point in the future.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Charlie Gao (Posit PBC) Keyword(s): asynchronous programming, distributed computing, parallel computing, open source tools Video recording available after conference: ✅ |
Charlie Gao (Posit PBC) |
| TBD | Thinking Inside the {box}: A Structured Approach for Full-Stack App DevelopmentMore infoAs Shiny applications scale, maintaining clean structure, managing dependencies, and ensuring long-term maintainability become increasingly challenging. The {box} package modernizes R’s approach to modularization, while {rhino} provides a structured framework for building robust Shiny apps. Together, they offer a structured and scalable workflow for Shiny development. In this talk we will explore how to leverage {box}'s modularity also for API development, using a structured approach to manage routers, endpoints, filters and error handlers. This workflow takes advantage of the programmatic usage of {plumber}, as an alternative for the annotation-based approach. To understand these concepts in a real-world scenario, this talk will present a case study of a Shiny application that integrates {box} for modular design and {plumber} for structured API development. We will walk through key architectural decisions, demonstrate how modularization improves maintainability, and explore how this approach streamlines both Shiny and API development. This will help attendees gain actionable insights they can apply to their own projects.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Samuel Enrique Calderon Serrano (Appsilon, R Shiny developer) Keyword(s): modules, shiny, box, api, production Video recording available after conference: ✅ |
Samuel Enrique Calderon Serrano (Appsilon R Shiny developer) |
| TBD | Transforming Public Health Data Management: From Individual Use to Scalable Workflows with RMore infoThe Health Information and Statistics Office within the Ministry of Health of Buenos Aires, Argentina, faced some key and unexpected challenges in its first year as an organization. As a small, 10-person team formed in 2019 building slow-paced data products with self-imposed goals, such as dashboards, they were hit with difficult tasks to be performed under pressure such as managing information and statistics workflows during a pandemic for a city with 3M inhabitants and serving hundreds of physicians and decision-makers of the public sector with almost real-time information. Over the years, this interdisciplinary team has tripled in size and has played a key role in high-impact strategic data science projects. These include developing data science solutions for extracting information from free-text data, creating complex algorithms for processing data from the city's Electronic Health Records, and implementing large-scale cost recovery initiatives in the healthcare system by cross-referencing massive datasets and generating +35k rendered documents per week shared with key agencies in the city. To fulfill these objectives, the team has built a robust infrastructure and a wide range of digital products—all within the R ecosystem. The talk will cover the strategies, tools, and lessons learned in building efficient and reproducible data workflows in the public sector in a context of very limited resources, and we’ll explore how R has been fundamental in transitioning from individual analyses to scalable, automated workflows.Date and time: Fri, Aug 1, 2025 - TBD Author(s): María Cristina Nanton (University of Buenos Aires) Keyword(s): public sector, data science, data mining, workflows, city management Video recording available after conference: ✅ |
María Cristina Nanton (University of Buenos Aires) |
| TBD | pkgdocs: a modular R package site generatorMore infopkgdocs is a new R package to generate package documentation as markdown from an R source package. Compared to other tools, pkgdocs is not focused on generating a static website directly, but rather pages that can be included in a larger documentation site. A common pattern in big projects is to modularize development in several R packages. By just generating markdown and not a finished static site, combining documentation of multiple packages is made easier. pkgdocs was made to work well with Hugo and the Docsy theme, but the markdown output should also be usable with other markdown-based static site generators with minor changes.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Daan Seynaeve; Anne-Katrin Hess Keyword(s): package, documentation, markdown, stateic site generator Video recording available after conference: ✅ |
Daan Seynaeve |
| TBD | rosella: diagnosing and interpreting classification modelsMore infoUnderstanding the behavior of complex machine learning models has become a challenge in the modern day. Explainable AI (XAI) methods were introduced to provide insights into model predictions, however interpreting these explanations can be difficult without proper visualisation methods. To fill this gap we have built rosella, an R package offering an interactive Shiny app that visualizes model behavior in the data space alongside XAI explanations. Designed for developers, educators, and students, rosella makes model decisions more accessible and interpretable.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Janith Wanniarachchi (Monash University); Dianne Cook (Monash University), Kate Saunders (Monash University), Patricia Menendez (University of Melbourne), Thiyanga Talagala (University of Sri Jayewardenepura) Keyword(s): high dimensional data, explainable ai, interactive tools, machine learning Video recording available after conference: ✅ |
Janith Wanniarachchi (Monash University) |
| Virtual Lightning | ||
| TBD | # Gen AI-Powered Shiny Dashboard for Financial CollectionsMore infoTracking collections performance at a granular level is crucial for financial institutions. Our Shiny-based Collection Dashboard, powered by Gen AI, transforms the way business teams interact with data. The dashboard monitors key metrics like Bounce rate, First EMI Bounce, Current resolution etc., with multi-level filtering by zone, state, region, and branch. To enhance usability, we introduced: - Automated PPT Generation: Users can download a fully customized PowerPoint presentation for any combination of filters. The charts are further enhanced with summaries and actionable items for business by an LLM, providing key takeaways. - "Talk to Your Data" (Text2SQL): Business teams can query the data in natural language—e.g., “Which zone had the highest bounce rate this month?”—and receive instant, downloadable reports. By integrating Gen AI, we’ve significantly reduced business teams' dependency on analytics for day-to-day data needs, empowering them with self-serve insights at scale.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Arnav Chauhan (Cholamandalam Investment and Finance Company Ltd.), Sreeram R (Cholamandalam Investment and Finance Company Ltd); Arnav Chauhan (Cholamandalam Investment and Finance Company Ltd.) Keyword(s): gen ai, r shiny , financial data, ppt generation, text2sql Video recording available after conference: ❌ |
Arnav Chauhan (Cholamandalam Investment and Finance Company Ltd.) Sreeram R (Cholamandalam Investment and Finance Company Ltd) |
| TBD | Exploring Fun and Functional R PackagesMore infoEveryone can use a bit of fun to improve our coding experience. While R is widely used for statistical analysis, it also has a creative and playful side. In this session, we’ll explore around 20 fun packages. Attendees will learn how to use packages like memer to create memes, emojifont to insert emojis into plots, wordcloud2 to generate interactive word clouds... By the end of the session, attendees will walk away with fresh ideas for integrating these tools into their daily workflows, whether for personal enjoyment or to create more engaging, impactful data visualizations. Preferred format: Lightning talk, open to talks or postersDate and time: Fri, Aug 1, 2025 - TBD Author(s): Joanna Chen (TikTok) Keyword(s): r packages, r for fun Video recording available after conference: ❌ |
Joanna Chen (TikTok) |
| TBD | Extending Shiny with React.js: Interactive Bubble Charts with nivo.bubblechartMore infoThe nivo.bubblechart package is an R interface to the nivo.rocks library, designed for creating interactive bubble charts in Shiny applications. Built on top of React.js and D3.js, nivo.rocks provides powerful and customizable visualizations that go beyond traditional R plotting libraries. This talk will demonstrate how nivo.bubblechart leverages the reactR package to seamlessly extend Shiny with React components, enabling highly interactive, dynamic, and responsive visualizations. The audience will gain insights into how reactR bridges the gap between R and JavaScript, allowing developers to integrate modern web technologies into their R applications. Through live examples and code snippets, this session will highlight the advantages of using React-powered widgets in Shiny and how they can enhance user experience with interactive graphics. Whether you're an R developer exploring JavaScript or a Shiny user looking to extend your UI capabilities, this talk will provide practical takeaways to level up your Shiny dashboards. GitHub URL: https://github.com/DataRacerEdu/nivo.bubblechartDate and time: Fri, Aug 1, 2025 - TBD Author(s): Anastasiia Kostiv Keyword(s): react.js shiny d3.js Video recording available after conference: ✅ |
Anastasiia Kostiv |
| TBD | GeoLink R packageMore infoGeoLink is an R package that assists users with merging publicly available geospatial indicators with georeferenced survey data. The georeferenced survey data can contain either latitude and longitude geocoordinates, or an administrative identifier with a corresponding shapefile. The procedure involves: Downloading geospatial indicator data, Shapefile tessellation, Computing Zonal statistics, and spatial joining of geospatial data with unit level data. The package, for example, can be used to link household characteristics measured in surveys with satellite-derived measures such as the average radiance of night-time light. The package can also calculate indicator values for each pixel covered by a tessellated grid in which a household is located. Finally, the package can be used to calculate zonal statistics for a user-defined shapefile (at native resolution or tessellated) and link the results to survey data. GeoLink complements the povmap and EMDI R packages to facilitate small area estimation with geospatial indicators. The latter two packages enable the estimation of regionally disaggregated indicators using small area estimation methods and includes tools for processing, assessing, and presenting the results.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Christopher Lloyd (University of Southampton (WorldPop)); Luciano Perfetti-Villa (University of Southampton (School of Geography and Environmental Science)) Keyword(s): social statistics, geospatial indicator, administrative unit, household survey, poverty mapping Video recording available after conference: ✅ |
Christopher Lloyd (University of Southampton (WorldPop)) |
| TBD | R-evealing Insights: Forecasting Demand and Visualizing Data for Optimal Dermatology Clinic OperationsMore infoIntroduction: Efficiently managing patient demand and resources is crucial in dermatology, especially in India, where the doctor-to-patient ratio varies significantly, with a national average of approximately 1:834. This presentation explores using machine learning algorithms to forecast demand in dermatological clinics and developing an interactive visualization platform using ggplot2 in R. The goal is to help clinics in India and beyond optimize operations, improve patient satisfaction, and enhance resource allocation. Methods: We employ machine learning algorithms, including time series analysis and regression models, to analyze historical patient data. These algorithms identify trends and seasonal variations, enabling accurate demand forecasting. Additionally, we develop an interactive visualization platform using ggplot2 in R. This platform provides intuitive visualizations of clinic data, such as the busiest days, main types of cases, and other critical metrics. It also includes scenario testing features to simulate various staffing and resource allocation strategies. Results: The machine learning models successfully predict demand patterns, allowing clinics to anticipate busy periods and allocate resources effectively. The ggplot2-based visualization platform offers dynamic and customizable charts, making it easy for dermatologists to understand their clinic's data. The scenario testing feature enables clinics to visualize the impact of different staffing and resource allocation strategies, facilitating data-driven decision-making. Conclusion: Combining machine learning forecasts with interactive visualizations empowers dermatological clinics to enhance efficiency, improve patient care, and manage resources effectively. This holistic approach ensures that clinics are well-prepared to meet patient demand, optimize operations, and deliver superior care.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Anjali Ancy; Subitchan . (SAVEETHA INSTITUTE OF MEDICAL AND TECHNICAL SCIENCES), Monisha M (SAVEETHA INSTITUTE OF MEDICAL AND TECHNICAL SCIENCES) Keyword(s): data visualisation, demand forecasting, data visualisation, ai in health, increasing patient involvement Video recording available after conference: ✅ |
Anjali Ancy |
| TBD | Shiny AI Regression and Prediction: Integration R Shiny with gemini.R PackageMore infoStarting from the limitations of several statistical applications and the vast AI environment in responding to user prompts, the Shiny AI Regression and Prediction (SHARP) application was developed to build statistical applications, particularly in linear regression modeling, with automatic result interpretation using AI-generated prompts. The broad AI response environment is restricted with specific commands to ensure more controlled outputs. The development of this application utilizes two main packages: shiny and gemini.R. Additionally, several supporting packages, including readxl, ggplot2, olsrr, and reshape2, are used for data import, visualization, and modeling. Finally, the application is deployed using the shinyapps.io platform. Links Poster: https://drive.google.com/file/d/1LX85iqVOB1sKExLYLrdDf7UKgsEGiwDQ/view?usp=sharing Demonstration: https://drive.google.com/file/d/1gNrnEHW8--acgYRq0Ukl3UgVRhcHfNbM/view?usp=sharing Datasets: https://drive.google.com/drive/folders/1ClN-B8xKOc3y-AeT7KpsPUWDg9VkwFd9?usp=sharing Application: https://bqhcpg-joko0ade-nursiyono.shinyapps.io/Sharp/Date and time: Fri, Aug 1, 2025 - TBD Author(s): Joko Ade Nursiyono (BPS - Statistics of East Java, Indonesia) Keyword(s): statistical application, data science, data mining, shiny, ai, data, deploy, application, automation, insight Video recording available after conference: ✅ |
Joko Ade Nursiyono (BPS - Statistics of East Java Indonesia) |
| TBD | The wrong ways to run code in RMore infoSome of the features of R can be misused to do very confusing if not outright misleading things. We are going to explore a few of them, showing how they are used normally and how they are not intended to be used, in a style borrowing from Wat by Gary Bernhardt: - Some S3 classes store functions inside the objects and call them from their own methods, akin to having a virtual method table in C++. By changing the stored function, we can makeprint(x) perform arbitrary actions. - Source references are attributes that link executable objects to their source code. An invalid source reference will obscure the real source code of a function, making it look as if it does something different. - Use of lazy evaluation and dynamic bindings can make variable access execute code. - In addition to the normal evaluator that interprets the LANGSXP syntax trees, R contains a bytecode evaluator, which is faster. If the bytecode and the normal body of a function disagree in important ways, the results can be very baffling.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Ivan Krylov (Lomonosov Moscow State University) Keyword(s): serialization, evaluation Video recording available after conference: ✅ |
Ivan Krylov (Lomonosov Moscow State University) |
| TBD | quickr: Translate R to Fortran for Improved PerformanceMore infoThis talk introduces 'quickr', an R package designed to make numerical R code faster by translating R functions to Fortran. While R code offers great flexibility, it often comes at the expense of performance, especially for computationally intensive tasks. To achieve better speed, users typically need to rewrite performance-critical code in compiled languages like C or Fortran, which adds complexity and creates maintenance overhead. Quickr simplifies this process by allowing users to add simple type declarations to their existing R functions, which enables quickr to then automatically translate the entire function into efficient Fortran routines. The presentation will demonstrate quickr in practical applications, with benchmarks showing performance improvements comparable to native C implementations. The talk will also cover current limitations, including supported data types and language features, and show how quickr can be easily integrated into existing R packages. Participants will learn how quickr can help improve their R code performance without significantly increasing development complexity or sacrificing the readability of their code.Date and time: Fri, Aug 1, 2025 - TBD Author(s): Tomasz Kalinowski (Posit PBC) Keyword(s): speed, hpc, numerical computing, type annotation, r syntax, declare(), fortran Video recording available after conference: ✅ |
Tomasz Kalinowski (Posit PBC) |
useR! 2025 conference program
but make it Python…🐍💅
Just like the official useR! program, also made with Quarto, but made with Python and great_tables, instead of R and gt.
Source code for the Python version at https://github.com/mine-cetinkaya-rundel/pydytuesday-useR2025, developed with help from Chat GPT in this thread.1
| Day 1: Friday, August 8, 2025 | |||
| Room | Title, abstract, and more info | Presenter(s) | |
|---|---|---|---|
| Tutorial | |||
| 08:30–12:00 | TBD | Causal Machine Learning in RMore infoIn both data science and academic research, prediction modeling is often not enough; we need to approach them causally to answer many questions. However, we can augment and improve causal inferences using machine learning techniques. In this workshop, we’ll teach the essential elements of combining machine learning and causal techniques to answer causal questions in R. We’ll cover causal diagrams and doubly robust causal modeling techniques that allow for valid inferences with ML models via targeted maximum likelihood estimation (TMLE). We’ll also show that we can better take advantage of both tools by distinguishing predictive models from causal models. This workshop assumes you have a basic understanding of prediction modeling and R.Learning goals: * Understand why prediction models can't answer causal questions * Understand how causal diagrams allow us to improve causal queries and how to use them in R * Develop doubly robust models in R to answer causal questions using machine learning techniques Target audience: Intermediate to advanced; we assume basic experience with R and machine learning. Previous experience with causal inference will be helpful but not required. Interested users can consult our book ahead of time. Prerequisites: None Date and time: Fri, Aug 8, 2025 - 08:30–12:00 Author(s): Malcolm Barrett (Stanford University) Keyword(s): causal inference, tmle, machine learning Video recording available after conference: ❌ |
Malcolm Barrett (Stanford University) |
| 08:30–12:00 | TBD | Debugging Tools for Functions in RMore infoIf you write functions but are unsure of efficient strategies to identify the source of errors then join this workshop to unlock your programming superpower with debugging techniques! In this workshop, we will review code troubleshooting tips, discuss debugging functions (traceback(), browser(), debug(), trace(), and recover()), and distinguish between strategies for debugging your own code versus someone else’s code.Learning goals: 1. Review code troubleshooting tips. 2. Apply debugging functions (traceback(), browser(), debug(), trace(), and recover()) and identify the additional benefits of employing some of these strategies within RStudio. 3. Distinguish between strategies for debugging your own code versus someone else’s code. Target audience: Individuals with experience writing functions but new to debugging. Prerequisites: None Date and time: Fri, Aug 8, 2025 - 08:30–12:00 Author(s): E. David Aja (Posit PBC), Shannon Pileggi (The Prostate Cancer Clinical Trials Consortium) Keyword(s): debugging, functions Video recording available after conference: ❌ |
E. David Aja (Posit PBC) Shannon Pileggi (The Prostate Cancer Clinical Trials Consortium) |
| 08:30–12:00 | TBD | From Model to Meaning: How to use the marginaleffects package to interpret results from statistical or machine learning modelsMore infoOur world is complex. To make sense of it, data analysts routinely fit sophisticated statistical or machine learning models. Interpreting the results produced by such models can be challenging, and researchers often struggle to communicate their findings to colleagues and stakeholders. This tutorial is designed to bridge that gap. It offers a practical guide to model interpretation for analysts who wish to communicate their results in a clear and impactful way. Tutorial attendees will be introduced to themarginaleffects package and to the conceptual framework that underpins it. The marginaleffects package for R offers a single point of entry for computing and plotting predictions, counterfactual comparisons, slopes, and hypothesis tests for over 100 different types of models. The package provides a simple and unified interface, is well-documented with extensive tutorials, and is model-agnostic—ensuring that users can extract meaningful quantities regardless of the modeling framework they use. The book Model to Meaning: How to Interpret Statistical Results Using marginaleffects for R (forthcoming with CRC Chapman & Hall) introduces a powerful conceptual framework to help analysts make sense of complex models. It demonstrates how to extract meaningful quantities from model outputs and communicate findings effectively using marginaleffects. This tutorial will provide participants with a deep understanding of how to use marginaleffects to improve model interpretation. Attendees will learn how to compute and visualize key statistical summaries, including marginal means, contrasts, and slopes, and how to leverage marginaleffects for hypothesis and equivalence testing. The package follows tidy principles, ensuring that results integrate seamlessly with workflows in R, and with other packages such as ggplot2, Quarto, and modelsummary. This tutorial is suitable for data scientists, researchers, analysts, and students who fit statistical models in R and seek an easy, reliable, and transparent approach to model interpretation. No advanced mathematical background is required, but familiarity with generalized linear models like logistic regression is assumed.Learning goals: None Target audience: None Prerequisites: None Date and time: Fri, Aug 8, 2025 - 08:30–12:00 Author(s): Vincent Arel-Bundock (Université de Montréal) Keyword(s): model interpretation, statistical analysis, marginaleffects, regression modeling, causal inference Video recording available after conference: ❌ |
Vincent Arel-Bundock (Université de Montréal) |
| 08:30–12:00 | TBD | Tidy manipulation of genomic dataMore infotidyomics is an open source project to enable a tidy data analysis framework for omics data, such as single cell gene expression, genomic annotation, chromatin interactions, and more. tidyomics enables the use of familiar tidyverse verbs (select, filter, mutate, etc.) to manipulate rich data objects in the R/Bioconductor ecosystem. In this workshop, we will give a high level overview of the project, and then work through a number of examples involving experimental datasets and typical bioinformatics tasks, showing how these can be cast as tidy data analyses.Learning goals: * Basic tidy operations on experimental data and genome annotation * Bulk and single cell expression with tidy manipulation and visualization * Examples of how to integrate diverse genomic datasets (ChIP-seq and RNA-seq) Target audience: Bioinformatics audience at any level, some knowledge of dplyr is helpful Prerequisites: None Date and time: Fri, Aug 8, 2025 - 08:30–12:00 Author(s): Justin Landis (UNC), Michael Love (UNC-Chapel Hill) Keyword(s): tidy data, genomics, bioinformatics, bioconductor Video recording available after conference: ❌ |
Justin Landis (UNC) Michael Love (UNC-Chapel Hill) |
| 13:00–16:30 | TBD | Complex Survey Data Analysis: A Tidy Introduction with {srvyr} and {survey}More infoThis interactive tutorial will introduce how to conduct analysis of survey data in R. We will first introduce a unifying workflow of tidy survey analysis in R for analysis of survey microdata with weights. We will cover topics of descriptive analysis, including functions to obtain weighted proportions, means, quantiles, and correlations from survey data. Then, we will discuss some statistical testing, including t-tests for comparing means and χ-squared tests for comparing proportions. Finally, we will discuss common probability sampling designs and how to create the survey design objects in R to account for the sampling design. The tutorial will include time for exercises using data from the 2020 American National Election Study and the 2020 Residential Energy Consumption Survey, so you can get hands-on experience with the functions. We will be using Posit Cloud, so you do not need to have R or RStudio preinstalled on your computer. For the best learning experience, we recommend you have some prior experience with R and the tidyverse, including familiarity withmutate, summarize, count, and group_by.Learning goals: 1. Interpret documentation accompanying survey data and set up a survey design object in R. 2. Calculate weighted means, quantiles, proportions, and correlations along with standard errors and confidence intervals for survey data. 3. Specify t-tests for continuous survey data and understand the difference between two-sample t-tests and paired t-tests. Along the way, implement “dot” notation for passing design objects into the test function. 4. Conduct goodness of fit tests, tests of independence, and tests of homogeneity for categorical survey data. Target audience: Analysts who want to analyze survey microdata (record-level data) with weights and disseminate results. May already use another language for survey analysis or are just starting out in survey analysis. Prerequisites: None Date and time: Fri, Aug 8, 2025 - 13:00–16:30 Author(s): Rebecca Powell (Fors Marsh), Isabella Velásquez (Posit PBC), Stephanie Zimmer (RTI International) Keyword(s): survey analysis, statistical testing, weighted analysis Video recording available after conference: ❌ |
Rebecca Powell (Fors Marsh) Isabella Velásquez (Posit PBC) Stephanie Zimmer (RTI International) |
| 13:00–16:30 | TBD | Getting Started with Positron: A Next-Generation IDE for data scienceMore infoPositron is a next-generation data science IDE built by Posit PBC that combines the best features of RStudio and Visual Studio Code. This tutorial will introduce R users to Positron's core capabilities, with a special emphasis on helping RStudio users while highlighting its seamless integration with the R ecosystem. For R programmers coming from RStudio, Positron delivers a familiar yet enhanced environment for data analysis and package development, while offering a path to Python when needed. Unlike traditional software-development oriented IDEs, Positron provides first-class support for data science-specific workflows through its native support for R (via the Ark kernel), along with designated areas for Variables (Environment), Connections, Plots, Help, and more that RStudio users have come to rely on. During this hands-on tutorial, participants will learn how to: - Install, configure, and update Positron for an R-focused workflow - Navigate Positron's interface and understand how it compares to RStudio - Use Positron's advanced features for interactive R coding and data exploration - Customize Positron with useful settings, extensions, and keyboard shortcuts - Implement a project-based workflow with a workspace We'll explore Positron's innovative features that enhance R productivity, such as the improved interactive Data Explorer and the ability to switch between different R installations. The tutorial will include practical demonstrations of key workflows, such as developing and publishing Shiny apps and Quarto documents, package development, and data visualization. For R users curious about Python, we'll briefly demonstrate how Positron makes Python accessible within a familiar environment. We'll also briefly cover compatibility with VS Code extensions relevant to R users and how to leverage them through the Open VSX Registry. We’ll survey various ways to access GenAI for coding assistance and a high-level overview of more specialized topics, such as remote development and integrations with cloud providers. By the end of this tutorial, participants should understand how to accomplish their most-used workflows in Positron and how to tailor the IDE to their specific needs.Learning goals: Participants will learn how to set up Positron for their data science work and navigate its interface. They will gain practical experience with Positron's data-focused tools including the Variables pane, Data Explorer, and Plot viewer. By the end of the tutorial, participants will be able to customize Positron to maintain their familiar RStudio workflow patterns while gaining access to new capabilities and understand how to effectively transition their R-based data science projects to this new environment. Target audience: This tutorial is designed specifically for R users familiar with the RStudio IDE who want to explore Positron as an alternative or additional tool. It's particularly valuable for R programmers who occasionally need to use Python or other languages (Rust, C++, Lua, etc), or who collaborate with Python users. Both experienced R users looking to expand their toolset and RStudio enthusiasts curious about the next generation of R development environments will benefit from this session. Prerequisites: None Date and time: Fri, Aug 8, 2025 - 13:00–16:30 Author(s): Jennifer Bryan (Posit PBC), Julia Silge (Posit PBC); Julia Silge (Posit PBC) Keyword(s): positron, r, ide, data science, rstudio Video recording available after conference: ❌ |
Jennifer Bryan (Posit PBC) Julia Silge (Posit PBC) |
| 13:00–16:30 | TBD | R You Out of Memory Again? Level Up Your Data Game with Arrow and DuckDBMore info"I can't analyze this dataset—R keeps running out of memory!" This common frustration signals a critical gap in the R analyst's toolkit. This hands-on tutorial empowers tidyverse users to break through memory limitations by leveraging two game-changing technologies: Apache Arrow and DuckDB. Arrow provides a cross-language columnar memory format that enables efficient processing of large datasets without full memory loading, while DuckDB offers an embeddable analytical database engine that excels at complex aggregations and joins. When combined with dplyr's grammar of data manipulation, these tools create a powerful framework for scalable data analysis. The beauty of this approach? You can keep using the dplyr syntax you already know and love. Arrow and DuckDB work seamlessly with dplyr's grammar of data manipulation, translating familiar verbs into high-performance operations that process data outside of R's memory constraints. This means analyzing gigabytes of data on your laptop without rewriting your existing code or learning entirely new frameworks. Through practical examples with real-world datasets, participants will discover how to: Transform existing dplyr pipelines to process larger-than-memory datasets Navigate the complementary strengths of Arrow (streaming operations, columnar processing) and DuckDB (complex aggregations, efficient joins) Integrate SQL when needed for specialized operations Optimize query performance through execution strategies like predicate pushdown and parallel processing We'll focus on immediately applicable techniques rather than theory. Each concept is paired with hands-on exercises where participants implement patterns they can directly transfer to their own projects. You'll experience firsthand the thrill of processing datasets 10-100x larger than previously possible with standard R. By the tutorial's end, participants will confidently decide which tool fits each analytical challenge and implement scalable workflows that grow with their data needs. The days of "cannot allocate vector of size..." errors will be behind you. All materials, including code examples and datasets, will be available in a GitHub repository, ensuring continued learning beyond the workshop. Join us to transform your data analysis capabilities and remove the memory ceiling that's been holding back your R workflowsLearning goals: Configure and integrate Arrow and DuckDB within an R environment. Translate dplyr workflows to handle out-of-memory datasets. Optimize query performance and implement scalable data processing workflows. Combine R and SQL for complex analytical operations. Target audience: Data scientists, analysts, and researchers who are comfortable with tidyverse packages (particularly dplyr) and are encountering performance limitations when working with larger datasets. The tutorial will benefit professionals across academia, industry, and government sectors who need to scale their R-based data analysis. Prerequisites: None Date and time: Fri, Aug 8, 2025 - 13:00–16:30 Author(s): Elyse Armstrong (Common App), JEANNE MCCLURE (NC State University), Sheila Saia (R-Ladies RTP); Elyse Armstrong (Common App), Sheila Saia (R-Ladies RTP) Keyword(s): big data, dplyr, apache arrow, duckdb, performance optimization Video recording available after conference: ❌ |
Elyse Armstrong (Common App) JEANNE MCCLURE (NC State University) Sheila Saia (R-Ladies RTP) |
| 13:00–16:30 | TBD | Teaching statistics and data science with R and GitHubMore infoIn this tutorial, participants will learn about teaching R and GitHub in statistics and data science courses. We will discuss pedagogy and curriculum design for effectively teaching computing alongside statistical concepts. Participants will explore example in-class activities and assignments that demonstrate the student experience, while discussing strategies for implementing such activities from the instructor perspective. We will also discuss computing infrastructure options that enable students to use R and Rstudio from a web browser with minimal set up. Lastly, we will show how instructors can use R and Quarto to make course materials and streamline their workflow in a reproducible way using GitHub. The tutorial will focus on teaching introductory-level undergraduate students with no previous computing experience, but the tutorial content is applicable for instructors teaching high school courses and courses throughout the undergraduate statistics and data science curriculum.Learning goals: Learn pedagogical strategies for teaching R and GitHub in a statistics or data science course - Identify how computing can be integrated alongside statistical concepts in a course curriculum - Experience computing activities and assignments from both the student and instructor perspective - Consider the computing infrastructure that may be the best fit for your student population - Learn how to develop course materials with R and Quarto and develop a reproducible workflow with GitHub Target audience: This workshop is for instructors interested in teaching R in their statistics and data science courses. The workshop will be presented from the perspective of teaching at the undergraduate level;; however, the contents of this workshop will also be beneficial to instructors teaching high school statistics and data science. Prerequisites: None Date and time: Fri, Aug 8, 2025 - 13:00–16:30 Author(s): Elijah Meyer (North Carolina State University), Maria Tackett (Duke University) Keyword(s): data science education, pedagogy, quarto, github, webr Video recording available after conference: ❌ |
Elijah Meyer (North Carolina State University) Maria Tackett (Duke University) |
| Poster | |||
| 18:15–19:30 | Gross Hall Energy Hub | Applying GAM and MARS using R to predict daily household electricity load curvesMore infoWe show how to achieve our objective of predicting the daily household electricity load curves using the R statistical language and several state packages that implement state of the art techniques. The widespread deployment of smart meters in the residential and tertiary sectors has made it possible to collect high-frequency electricity consumption data at the consumer level (individuals, professionals, etc.). This data is a raw material for research on the prediction of electricity consumption at this level. The majority of this research is largely aimed at meeting the needs of industry, such as applications in the context of smart homes and programs for managing and reducing consumption. The objective of this work is to deploy or implement short-term (D + 1) electrical load forecasting models at the consumer level. The complexity of the subject lies in the fact that consumption data on this scale is very volatile. Indeed, it includes a large amount of noise and depends on the consumer's lifestyle and consumption habits. We studied the influence of integrating outdoor temperature in different forms on the performance of a Generalized Additive Models (GAM) model and a Multivariate Adaptive Regression Splines (MARS) model. These two models are capable of modelling both linear relationships and non-linear interactions between influencing factors (independent variables), and were adapted to model the temperature sensitivity of load curves. The models were tested and evaluated on a large sample of disparate load curves in the residential sector. An approach was also proposed for the prediction of the most volatile load curves. We will wrap an example dataset as well as the scripts that were used in this work in a package that will be available online.Date and time: Fri, Aug 8, 2025 - 18:15–19:30 Author(s): Frederic Bertrand (Troyes University of Applied Sciences); Fatima Fahs (ES), Myriam Maumy (EHESP) Keyword(s): statistical learning, generalized additive models, multivariate adaptive regression splines, prediction, daily electricity load curves |
Frederic Bertrand (Troyes University of Applied Sciences) |
| 18:15–19:30 | Gross Hall Energy Hub | Automating GCaMP Fluorescence Analysis for Neuronal Activity Quantification in Stress Response StudiesMore infoCalcium imaging using genetically encoded calcium indicators (GECIs) such as GCaMP provides a critical window into neuronal activity, particularly in response to physiological stress. However, the large volume of imaging data presents challenges in efficient processing, normalization, and statistical analysis. This project introduces an automated R-based workflow that extracts, normalizes, and analyzes fluorescence intensity changes to identify significant neuronal responses. The pipeline standardizes intensity values against baseline fluorescence, filters for neurons exhibiting meaningful activity, and applies statistical modeling—including two-way ANOVA with post-hoc comparisons—to assess differences across experimental conditions. By integrating data processing and statistical analysis, this approach streamlines fluorescence quantification, reducing manual intervention while enhancing reproducibility. The methodology is broadly applicable across neuroscience, bioinformatics, and computational research, providing a scalable solution for analyzing calcium imaging data in diverse experimental settings.Date and time: Fri, Aug 8, 2025 - 18:15–19:30 Author(s): Calvin Cho (Duke University, Trinity College of Arts & Sciences); Carlene Moore (Duke University School of Medicine), Christopher Wickware (Duke University School of Medicine) Keyword(s): automation, statistical modeling, bioinformatics, medical research, data analysis |
Calvin Cho (Duke University Trinity College of Arts & Sciences) |
| 18:15–19:30 | Gross Hall Energy Hub | Cloud-Based and AI-assisted Workflows in R: Case Studies from North Carolina State University’s Data Science Consulting ProgramMore infoThe Data Science Consulting Program at North Carolina State University, supports researchers in analytics, statistics, and data visualization by leveraging cloud-based and AI-assisted workflows that streamline collaboration and eliminate infrastructure challenges. For smaller projects, we use Google Colab, which supports both R and Python, enabling multiple consultants to work together without requiring extensive version control expertise. Since Colab operates entirely in the cloud, patrons can execute workflows seamlessly without the burden of local setup. Additionally, Gemini AI integration provides real-time coding and environment support, reducing time spent troubleshooting and navigating documentation. For larger, scalable projects, we turn to Posit Cloud, a cloud-based IDE for building and deploying interactive dashboards in R and Python. With Shiny Assistant, consultants and researchers receive AI-powered guidance on UI/UX design, backend development, and deployment, ensuring an efficient workflow. To maintain structured collaboration, we manage a modular codebase through a private GitHub repository, allowing for better version control and teamwork. In this poster, we will highlight case studies from our program, illustrating how these cloud-based and AI-assisted workflows enhance research in R by improving collaboration, reducing technical overhead, and increasing efficiency.Date and time: Fri, Aug 8, 2025 - 18:15–19:30 Author(s): Ishti Sikder (North Carolina State University); Shannon Ricci (North Carolina State University), Alp Tezbasaran (North Carolina State University) Keyword(s): collaboration, cloud, generative ai, consulting |
Ishti Sikder (North Carolina State University) |
| 18:15–19:30 | Gross Hall Energy Hub | Leveraging R for Multi-State Modeling in Real-World Oncology ResearchMore infoSurvival analysis is fundamental to oncology research, enabling the estimation of time-to-event outcomes such as overall survival (OS), progression-free survival (PFS), time to treatment discontinuation (TTD), and time to next treatment (TTNT). Multi-state modeling (MSM) extends survival analysis by incorporating dynamic transitions between treatment lines, while accounting for censoring and competing risks. This study demonstrates how R and the mstate package can be used to facilitate the development of complex MSM frameworks for analyzing oncology treatment pathways. This study used the nationwide Flatiron Health electronic health record (EHR)-derived deidentified database. The Flatiron Health database is a longitudinal database, comprising deidentified patient-level structured and unstructured data, curated via technology-enabled abstraction. We implemented a multi-state model in R to track patient transitions from first-line therapy (1L) to subsequent lines (2L, 3L+) and death, incorporating irreversible transitions to reflect real-world treatment pathways. The model adjusts for key baseline clinical covariates, including age, sex assigned at birth, ALK and EGFR biomarker status, and cancer stage, as well as covariates present at the time of state transition including percent change in weight from baseline and most recent serum albumin. The analysis was conducted using R packages survival, ggplot2, tidyverse, dplyr, and mstate, which provide a robust framework for data cleaning, transition probability estimation, patient trajectory visualization, and statistical inference. Non-parametric methods, including the Aalen-Johansen estimator and Kaplan-Meier estimator, were used to estimate transition probabilities, while the semi-parametric Cox proportional hazards model was applied to identify significant clinical factors influencing transitions. This study demonstrates how multi-state modeling techniques can enhance the ability to assess prognosis of time-to-event outcomes in oncology RWE.Date and time: Fri, Aug 8, 2025 - 18:15–19:30 Author(s): Aashay Mahesh Mehta; Spencer Langerman Keyword(s): multi-state modeling, real-world evidence, statistical methods, cox regression, oncology research |
Aashay Mahesh Mehta |
| 18:15–19:30 | Gross Hall Energy Hub | Optical Character Recognition (OCR) Screening in R for PFAS in Project DocumentsMore infoRamboll is frequently retained to analyze large batches of project documents as part of identifying per- and polyfluoroalkyl substances (PFAS) in client operations, such as for compliance with ongoing reporting requirements. Due to the scale of the effort needed to manually search project documents for thousands of terms that may be associated with PFAS (numerous compounds and variations in nomenclature), the need for automation can be very helpful in minimizing human error while reducing costs for our clients. The Optical Character Recognition (OCR) Screening Tool uses R to execute the screening of documents to identify and index CAS numbers, chemical names, trade names, and other PFAS-related keywords. The R tool was built for use within Ramboll for this purpose. This tool prepares the PDFs by organizing them into categories: those that can be read initially, those that need to have OCR applied, and those that will need human review, minimizing upfront effort. A large keyword list was developed by Ramboll’s PFAS subject matter expert team. The keyword list is organized in groups and can be customized based on client needs. It is important to note that the tool can be adapted for any keyword list. The R tool lists specific instances of relevant search terms within the document, along with approximate page numbers, building an index organized by the groupings provided by the user. Once complete, the output of the tool is summarized for use in a Shiny dashboard for easy viewing. Excel reports are also generated with the files returning results being hyperlinked automatically. A triage task is often implemented to screen the output for false positives and for identifications that require further action.Date and time: Fri, Aug 8, 2025 - 18:15–19:30 Author(s): Bruce Franz (Ramboll); Brian Drollette (Ramboll), Jon Hunt (Ramboll) Keyword(s): ocr,pfas,shiny,environmental science,health sciences |
Bruce Franz (Ramboll) |
| 18:15–19:30 | Gross Hall Energy Hub | Partially automated driving, EEG signals, and eye-tracking data: Using R to consolidate multiple subject files for machine learning modelsMore infoExploring the relationship between driver’s attention and conversational prompts to enhance performance while utilizing partially automated cars, each participant of this multimodal driving simulation study produced five separate eye-tracking datasets and a Muse Headset EEG dataset for each of four separate driving scenarios with differing cognitive workloads. The present work highlights how R was utilized to consolidate these datasets together to be usable with machine learning packages, including caret, randomForest, and gbm. Critically, generating an initial dataset containing names of the CSV files available would allow the iteration of this process across all files in the working directory. All eye-tracking data for a participant in a given scenario were read into a pooled dataset to determine if any screen had a positive indicator variable value that the participant was looking at the proper screen on a second-by-second basis. This second-by-second consolidated eye-tracking dataset would then be combined with the second-by-second EEG dataset for the participant for that scenario. Once all consolidated datasets had been made for all participants’ scenarios, they were then stacked and prepared for analysis. Utilizing an 80/20 split for the training/test paradigm, five EEG signals (Alpha, Beta, Gamma, Delta, and Theta) from four channels (AF7, AF8, TP9, and TP10) were used to predict whether the participant was looking at the proper screen with around 85% accuracy. Implications of attentional considerations reflected by this data in the interest of driver safety will be discussed.Date and time: Fri, Aug 8, 2025 - 18:15–19:30 Author(s): Jesse DeLaRosa (Duke Clinical Research Institute); Xiaolu Bai (North Carolina State University), Jing Feng (North Carolina State University) Keyword(s): data wrangling, statistical learning, eeg, eye-tracking, self-driving cars |
Jesse DeLaRosa (Duke Clinical Research Institute) |
| 18:15–19:30 | Gross Hall Energy Hub | R-Ladies Global: Promoting Diversity and Inclusion in the R Community for Nearly a DecadeMore infoR-Ladies Global is a worldwide organization focused on achieving proportionate representation of genders currently underrepresented in the R programming community. To meet this goal, we support a network of local chapters who organize events that encourage, inspire, and empower individuals to meet their programming potential. Since R-Ladies Global was founded in 2016, it has grown to provide training and mentoring to over 100,000 members, in 244 chapters, and 63 countries. Local chapters have held over 4,200 events focused on a wide range of topics from workshops on popular R programming libraries (e.g., data.table, Shiny) and R package development to data science panels on various topics (e.g., Ethics in Data Science, Women in Tech) to hackathons (e.g., #TidyTuesday) to speaking opportunities (e.g., Lightning Talks) to networking events (e.g., dinner meet-ups, book clubs). The organization also maintains a directory of speakers, an abstract review system for conferences, and a YouTube channel with recordings of events, among other valuable resources. There are many leadership, training, career development, and mentoring opportunities for professionals who join R-Ladies Global. Please stop by our poster or visit our website (https://rladies.org/about-us) to learn more about how you can get involved and/or support our mission.Date and time: Fri, Aug 8, 2025 - 18:15–19:30 Author(s): Sheila Saia (R-Ladies RTP); R-Ladies Global Team (R-Ladies Global) Keyword(s): r-ladies, coding, community, diversity |
Sheila Saia (R-Ladies RTP) |
| 18:15–19:30 | Gross Hall Energy Hub | Reproducible Research at Scale using RMore infoReproducibility and collaboration are essential for scalable scientific research. In this presentation, we outline a workflow that integrates open-source tools with Posit’s proprietary products to enable and streamline reproducible research. Our approach leverages Backstage for automating project setup via template workflows, GitLab for version control and code review, and renv and Posit Package Manager for R environment management (internally- and externally-developed packages). Putting it all together, analytic outputs are created via Quarto in Posit Workbench and then shared with our clinical experts via Posit Connect. These templated workflows and tools allow Flatiron Health scientists to consistently produce high-quality analytic reports that can easily be reproduced and distributed across the organization. Our templates further ensure that developer setup is streamlined and output is styled consistently across projects and teams. By integrating industry-leading, open source tools, we create a robust, scalable workflow that embeds reproducibility and enhances collaboration across research teams.Date and time: Fri, Aug 8, 2025 - 18:15–19:30 Author(s): Nicole Nasrallah (Flatiron Health), Benjamin Wagner (Flatiron Health); Nicole Nasrallah (Flatiron Health), Michael Thomson (Flatiron Health), Erica Yim (Flatiron Health) Keyword(s): reproducibility, workflow, research |
Nicole Nasrallah (Flatiron Health) Benjamin Wagner (Flatiron Health) |
| 18:15–19:30 | Gross Hall Energy Hub | Safety first: Design-informed inference for treatment effects via the propertee package for RMore infoWhen treatments are allocated by cluster, it is vital for correct inference that the clustering structure be tracked and appropriately attended to. In randomized trials and observational studies modeled on RCTs, clustering is determined at the early stage of study design, with subtle but important implications for the later stage of treatment effect estimation. A first contribution of our "propertee" R package is to make analysis safer by providing self-standing functions to record treatment allocations, with the thus-encoded study design informing subsequent calculations of inverse probability weights, if requested, and of standard errors. A second contribution is to facilitate the use of precision-enhancing predictions from models fitted to external or partly external samples. The user experience is kept simple by adapting such familiar R mechanisms aspredict(), lm(), offset(), the sandwich package and summary(); under the hood it stacks estimating equations for sandwich estimates of variance. The propertee package makes it easy and safe to produce Hajek- or block fixed effect estimates with appropriate standard errors, even in the presence of grouped assignment to treatment, repeated measures, subgroup-level estimation and/or covariance adjustment.Date and time: Fri, Aug 8, 2025 - 18:15–19:30 Author(s): Ben Hansen (University of Michigan); Adam Sales (Worcester Polytechnic Institute), Xinhe Wang (University of Michigan) Keyword(s): causal inference, conditional growth status model, design-based, direct adjustment, workflow |
Ben Hansen (University of Michigan) |
| 18:15–19:30 | Gross Hall Energy Hub | Scaling R Support in an Academic Library: Data-Driven Insights from North Carolina State University’s Data Science Consulting ProgramMore infoNorth Carolina State University’s Data Science Consulting Program, housed within the NC State University Libraries, has seen a steady increase in R-related research requests over the past few years. This poster showcases how our team systematically tracks and analyzes consultation data to identify emerging trends, refine service offerings, and guide staffing decisions. We illustrate key patterns—such as the steady increase in demand for tidyverse-based data wrangling, reproducible workflows with R Markdown, and interactive Shiny applications—and detail how these insights inform our tailored workshops and one-on-one support. In addition to quantitative metrics, we highlight the collaborative and inclusive environment that underpins our consulting approach. From assisting novice users with their very first script to supporting advanced modeling for cross-disciplinary projects, our goal is to maintain a no-judgment, solution-oriented culture that empowers researchers at all skill levels. By sharing anonymized case studies and lessons learned, we demonstrate how a blend of data-driven planning and human-centered consulting can help institutions efficiently scale R support services.Date and time: Fri, Aug 8, 2025 - 18:15–19:30 Author(s): Claire Murphy (North Carolina State University); Franziska Bickel (North Carolina State University), Alp Tezbasaran (North Carolina State University), Selene Schmittling (North Carolina State University), Shannon Ricci (North Carolina State University) Keyword(s): research, data-driven analysis, consulting, r support |
Claire Murphy (North Carolina State University) |
| 18:15–19:30 | Gross Hall Energy Hub | The Workplace Wellbeing Assessment: Using R to evaluate organizational factors impacting mental health and wellbeing of international aid workersMore infoThe project presented here details the development of a diagnostic instrument, the Workplace Wellbeing Assessment (WWA), to help international aid organizations assess organizational structures that impact employee wellbeing. Background Humanitarian and development organizations (aid organizations) operate in environments that expose employees to unique stressors, including armed conflict, natural disasters, and working with traumatized clients. However, internal organizational factors have been shown to have a similarly large impact on employee mental health and wellbeing as the aforementioned acute stressors. These factors include unsupportive policies, unclear role expectations, poor work-life balance, inadequate compensation, and a lack of mental health resources. There is an urgent need for a practical tool to 1) help organizations evaluate how their policies and practices affect employee mental health, and 2) provide evidence-based recommendations for amending these policies and practices. The instrument The WWA consists of three distinct parts: 1. The questionnaire, built in Qualtrics and based on the [Workplace Mental Health & Well-Being framework][1], allows employees to provide input on key organizational structures, policies, and cultural aspects that influence employee wellbeing. 2. The assessment tool uses the qualtRics R package to retrieve the survey data via the Qualtrics API. It then decodes the questionnaire responses and provides a score for each of the five "Essentials" of the framework as well as their individual components, offering an objective evaluation of the organization’s strengths and weaknesses with regard to mental health and wellbeing. Finally, it uses OpenAI API to summarize the survey's text responses, providing organizations with anonymized qualitative data. 3. The recommendation tool consists of a Shiny app that acts as a data dashboard for organizations, providing organizations with key charts, tables, and insights. Furthermore, for components with low scores, the dashboard provides organizations with tailored, actionable, evidence-based guidance (written by the author) on how to improve these problem areas. While the codebase is relatively simple, this project serves as an example for public health practitioners that R is a versatile tool that can be used for more than biostatistics and academic research - in this case to drive employee mental health and wellness improvements for international aid organizations. [1]: https://www.hhs.gov/sites/default/files/workplace-mental-health-well-being.pdfDate and time: Fri, Aug 8, 2025 - 18:15–19:30 Author(s): Julius Torres Kellinghusen (New York University - School of Global Public Health) Keyword(s): mental health, humanitarian aid, survey analysis, shiny, ai |
Julius Torres Kellinghusen (New York University - School of Global Public Health) |
| 18:15–19:30 | Gross Hall Energy Hub | Using R and explainable machine learning to estimate green value of householdMore infoWe show how to achieve our objective of estimating the green value of housing by focusing on energy performance labels in order to understand how housing prices evolve when energy performance improves using the R statistical language and several state packages that implement state of the art techniques. Instead of fitting a hedonic modeling that is some special kind of linear model, and as it was done in previous works, we fit random forests or XGBoost models. Unlike linear models, which directly reveal the relative importance of the variables via coefficients, these complex models require alternative methods to quantify the impact of the input variables. Shapley values are often used to tackle this issue for random forests and XGBoost models, that do not provide explicit coefficients. Their calculation guarantees that each feature is fairly represented, taking into account all possible combinations of variables. However, with non-linear and complex models such as random forests and XGBoost, the exact calculation of Shapley values becomes computationally prohibitive. As a consequence we used more efficient approximation methods such as SHAP, KernelSHAP and FastSHAP to interpret the predictions given by models and we managed to propose an estimate of the “green value” of a housing. We will wrap an example dataset as well as the scripts that were used in this work in a package that will be available online.Date and time: Fri, Aug 8, 2025 - 18:15–19:30 Author(s): Myriam Maumy (EHESP); Frederic Bertrand (Troyes University of Applied Sciences), Elizaveta Logosha Keyword(s): machine learning, explainable ai, random forests, shapley values, green value |
Myriam Maumy (EHESP) |
| 18:15–19:30 | Gross Hall Energy Hub | Utilizing R and Terra to describe the geographic distribution of patients with Early and Advanced Non-Small Cell Lung CancerMore infoIn this case study we demonstrate the usage of the terra package to create a choropleth map describing the distribution of patients with early and advanced stage Non-Small Cell Lung Cancer (NSCLC) within the continental united states using data from the nationwide Flatiron Health electronic health record (EHR)-derived deidentified database. The study included adults aged ≥18 years who were diagnosed with NSCLC between January 2011 and December 2023. US 3 digit zip code area boundaries derived from the US census Bureau’s Zip Code Tabulation Areas were processed in R using the terra package and then linked to de-identified patient addresses. Within each ZIP3 boundary, the number of patients with early or advanced stage at diagnosis was summarized within 8 levels: less than 20, 21-50, 51-75, 76-100, 101-500, 501-1000, 1001-2000, and 2001 or greater. This visualization approach provides a clear representation of the geographic availability of data within the clinical data source and can be used to inform targeted clinical trial enrollment efforts or assess geographic representativeness of observational real-world studies.Date and time: Fri, Aug 8, 2025 - 18:15–19:30 Author(s): Yunzhi Qian; Spencer Langerman (Flatiron Health) Keyword(s): gis, real-world data, data visualization, oncology |
Yunzhi Qian |
| 18:15–19:30 | Gross Hall Energy Hub | rPlaywright: Bringing Playwright’s Power to R for Scalable Web AutomationMore infoBrowser automation is critical for web scraping, automated testing, and workflow optimisation, yet existing R solutions often struggle with dynamic, JavaScript-intensive websites. Traditional approaches commonly rely on static HTML parsing or external browser drivers, which tend to be slow, brittle, and difficult to scale. To address these challenges, this project introduces rPlaywright, a novel R wrapper for the modern browser automation library, Playwright. rPlaywright empowers R users with robust browser automation capabilities across Chromium, Firefox, and WebKit, enabling effortless interaction with dynamic web content through built-in auto-waiting, infinite scrolling, and headless browser interactions, all within a seamless R workflow. Creating rPlaywright required translating Playwright’s asynchronous JavaScript-based API into R’s synchronous programming paradigm, a non-trivial challenge. This involved establishing an API bridge using a lightweight Fastify server to mediate communication between R and Node.js, managing JavaScript promises from within R, and developing an intuitive R6-based interface to preserve Playwright’s original APIs design flexibility and power, yet tailored specifically for R users. Initiated through my participation in the rOpenSci Champions Program, this project represents a significant step towards expanding modern web automation capabilities within the R ecosystem. It provides R users—whether web scrapers, automated testers, or data analysts—with a scalable, efficient, and user-friendly toolkit for working effectively with modern, interactive web environments. This poster will showcase rPlaywright, discuss key technical considerations during its development, and demonstrate practical integrations with popular R packages for streamlined data extraction and analysis.Date and time: Fri, Aug 8, 2025 - 18:15–19:30 Author(s): Erika Siregar (University of Sheffield, R-Ladies Jakarta) Keyword(s): browser automation, web scraping, r package development, playwright, dynamic web data, ropensci |
Erika Siregar (University of Sheffield R-Ladies Jakarta) |
| Day 2: Saturday, August 9, 2025 | |||
| Room | Title, abstract, and more info | Presenter(s) | |
|---|---|---|---|
| Data visualization | |||
| 10:30–12:00 | Penn 1 | From #EconTwitter to the White House: Real-Time Economic Data with RMore infoIt's not just financial markets. Policy and economics reporters, commentators, and public officials all use real-time analysis of leading economic data as soon as it is available. Moments after the release of jobs numbers, inflation rates, or GDP data, policymakers, journalists, and commentators dive into real-time interpretation and visualization. In this high-speed environment, the right tools are essential, and R stands out as particularly powerful. Join Mike Konczal as he shares his firsthand experiences using R in real-time following data releases to create viral graphics on #EconTwitter, prepare quotes for reporters and materials for media appearances, and even coordinate analysis at the White House, where he served covering economic data for the National Economic Council. You'll learn the process, from how to access and manipulate government economic data to making your own economic work clear and accessible to the broader public.Date and time: Sat, Aug 9, 2025 - 10:30–12:00 Author(s): Mike Konczal (Economic Security Project) Keyword(s): economics, politics, finance, macroeconomics, public communications Video recording available after conference: ✅ |
Mike Konczal (Economic Security Project) |
| 10:30–12:00 | Penn 1 | Visualising Uncertainty with ggdibblerMore infoAdding uncertainty representation in a data visualisation can help in decision-making. There is an existing wealth of software designed to visualise uncertainty as a distribution or probability. These visualisations are excellent for helping understand the uncertainty in our data, but they may not be effective at incorporating uncertainty to prevent false conclusions. Successfully preventing false conclusions requires us to communicate the estimate and its error as a single “validity of signal” variable, and doing so proves to be difficult with current methods. In this talk, we introduce ggdibbler, a ggplot extension that makes it easier to visualise uncertainty in plots for the purposes of preventing these “false signals”. We illustrate how ggdibbler can be seamlessly integrated into existing visualisation workflows and highlight the effect of these changes by showing the alternative visualisations ggdibbler produces for a choropleth map.Date and time: Sat, Aug 9, 2025 - 10:30–12:00 Author(s): Harriet Mason (Monash University); Dianne Cook (Monash University, Australia), Sarah Goodwin (Monash University), Susan Vanderplas (University of Nebraska - Lincoln) Keyword(s): uncertainty, data visualisation, ggplot, r package Video recording available after conference: ✅ |
Harriet Mason (Monash University) |
| 10:30–12:00 | Penn 1 | Visualizing time with ggtime's grammar of temporal graphicsMore infoWhile several commonly used plots exist for visualizing time series, little work has been done to formalize them into a unified grammar of temporal graphics. Re-expressing traditional time series graphics such as time plots and seasonal plots with grammatical elements supports deeper customization options. Composable grammatical elements provide the flexibility needed to easily visualize multiple seasonality, cycles, and other complex temporal patterns. These modular elements can be composed together to create familiar time series graphics, and also recombined to create new informative plots. The ggtime package extends the ggplot2 ecosystem with new grammar elements and plot helpers for visualising time series data. These additions leverage calendar structures to visually align time points across different granularities and timezones, warp time to standardize irregular durations, and wrap time into compact calendar layouts. In this talk, I will introduce ggtime and demonstrate how its grammar of temporal graphics enables a flexible visualization of time series patterns.Date and time: Sat, Aug 9, 2025 - 10:30–12:00 Author(s): Mitchell O'Hara-Wild (Monash University); Cynthia Huang (Monash University) Keyword(s): grammar of graphics, time series, calendars, package design, ggplot2 extension Video recording available after conference: ✅ |
Mitchell O'Hara-Wild (Monash University) |
| 10:30–12:00 | Penn 1 | tinyplot: convenient and customizable base R plotsMore infoThe {[tinyplot][1]} package provides a lightweight extension of the base R graphics system. It aims to pair the concise syntax and flexibility of base R plotting, with the convenience features pioneered by newer ({grid}-based) visualization packages like {ggplot2} and {lattice}. This includes the ability to plot grouped data with automatic legends and/or facets, advanced visualization types, and easy customization via ready-made themes. This talk will provide an introduction to {tinyplot} in the form of various plotting examples, describe its motivating use-cases, and also contrast its advantages (and disadvantages) compared to other R visualization libraries. The package is available on CRAN. [1]: https://grantmcdermott.com/tinyplot/Date and time: Sat, Aug 9, 2025 - 10:30–12:00 Author(s): Grant McDermott (Amazon); Vincent Arel-Bundock (Université de Montréal), Achim Zeileis (Universität Innsbruck) Keyword(s): data viz, base graphics Video recording available after conference: ✅ |
Grant McDermott (Amazon) |
| Modeling 1 | |||
| 10:30–12:00 | Penn 2 | Adding new algorithms to {tidyclust}More infoThe {tidyclust} package, released in 2022, brings unsupervised learning to the {tidymodels} framework. This talk will share an overview of process by which new models and algorithms are added to the {tidyclust} collection, based on recent work adding five new models for clustering and data mining (DBSCAN, GMM, BIRCH, itemset mining, and association rules). We will discuss in-depth the complications - programmatic, algorithmic, and philisophical - of adapting a supervised learning framework to unsupervised and semi-supervised settings. For example, what does it mean to tune a parameter in the absence of validating prediction metrics? How should row-based clustering be processed differently than column-based clustering? This talk is aimed at R users and developers who want to think deeply about the intersection between code design choices and methodological principles in unsupervised learning, and who want to peek behind the curtain of the {tidyclust} package framework.Date and time: Sat, Aug 9, 2025 - 10:30–12:00 Author(s): Kelly Bodwin (California Polytechnic State University) Keyword(s): tidymodels, tidyclust, unsupervised learning, clustering, package development Video recording available after conference: ✅ |
Kelly Bodwin (California Polytechnic State University) |
| 10:30–12:00 | Penn 2 | Modeling Eviction Trends in VirginiaMore infoVirginia is home to 5 of the top 10 cities in the country with the highest rates of eviction. Using civil court records, we are able to analyze the behavior of landlords, so that we can hold those in power accountable to make effective and just change. Where do landlords engage in more eviction actions? What characteristics of renters or landlords increase the practice of serial filing? Using administrative data -- information collected by government and agencies in the implementation of public programs -- we are able to evaluate systems and promote most just outcomes. Working with the Civil Court Data Initiative of Legal Services Corporation, we use data collected from civil court records in Virginia to analyze the behavior of landlords. Expanding on our Virginia Evictors Catalog, we use data on court evictions to build additional data tools to support the work of legal and housing advocates and model key eviction outcomes to contribute to our understanding of landlord behavior. First we visualized eviction activity across the state in an interactive Shiny app to address questions and needs of organizations providing legal, policy, and community advocacy. In addition we estimated landlord actions – eviction filings and serial filings – as a function of community and landlord characteristics. Using a series of mixed-effects models, with data aggregated to zipcodes nested in counties, we estimated the impact of community characteristics and landlord attributes on the likelihood of eviction filings. Participants will walk away with a better understanding of what influences landlord behavior, and will have a framework for investigating the practice in their own communities.Date and time: Sat, Aug 9, 2025 - 10:30–12:00 Author(s): Michele Claibourn (Center for Community Partnerships), Samantha Toet (Center for Community Partnerships) Keyword(s): shiny, data visualization, mixed-effects modeling, geography, social science Video recording available after conference: ✅ |
Michele Claibourn (Center for Community Partnerships) Samantha Toet (Center for Community Partnerships) |
| 10:30–12:00 | Penn 2 | Predictive Modeling with Missing DataMore infoMost predictive modeling strategies require there to be no missing data for model estimation. When there is missing data, there are generally two strategies for working with missing data: 1.) exclude the variables (columns) or observations (rows) where there is missing data; or 2.) impute the missing data. However, data is often missing in systematic ways. Excluding data from training is ignoring potentially predictive information and for many imputation procedures the missing completely at random (MCAR) assumption is violated. Themedley package implements a solution to modeling when there are systematic patterns of missingness. A working example of predicting student retention from a larger study of the Diagnostic Assessment and Achievement of College Skills (DAACS) will be explored. In this study, demographic data was collected at enrollment from all students and then students completed diagnostic assessments in self-regulated learning (SRL), writing, mathematics, and reading during their first few weeks of the semester. Although all students were expected to complete DAACS, there were no consequence and therefore a large percentage of student completed none or only some of the assessments. The resulting dataset has three predominate response patterns: 1.) students who completed all four assessments, 2.) students who completed only the SRL assessment, and 3). students who did not complete any of the assessments. The goal of the medley algorithm is to take advantage of missing data patterns. For this example, the medley algorithm trained three predictive models: 1.) demographics plus all four assessments, 2.) demographics plus SRL assessment, and 3.) demographics only. For both training and prediction, the model used for each student is based upon what data is available. That is, if a student only completed SRL, model 2 would be used. The medley algorithm can be used with most statistical models. For this study, both logistic regression and random forest are used. The accuracy of the medley algorithm was 3.5% better than using only the complete data and 3.1% better than using a dataset where missing data was imputed using the mice package. The medley package provides an approach for predictive modeling using the same training and prediction framework R users are accustomed to using. There are numerous parameters that can be modified including what underlying statistical models are used for training. Additional diagnostic functions are available to explore missing data patterns.Date and time: Sat, Aug 9, 2025 - 10:30–12:00 Author(s): Jason Bryer (City University of New York) Keyword(s): predictive modeling, r package Video recording available after conference: ✅ |
Jason Bryer (City University of New York) |
| 10:30–12:00 | Penn 2 | jarbes: an R package for Bayesian parametric and nonparametric bias correction in meta-analysisMore infoMeta-analysis methods help researchers answer questions that require combining statistical results across several studies. Very often, the only available studies are of different types and of varied quality. Therefore, when we combine disparate evidence at face value, we are not only combining results of interest but also potential biases that might threaten the quality of the results. Consequently, the results of the meta-analysis could be misleading. This work presents the R package jarbes, “Just a rather Bayesian Evidence synthesis.” This package has been designed explicitly for Bayesian evidence synthesis and meta-analysis. It implements a family of Bayesian parametric and nonparametric models for meta-analysis that account for multiple biases. A model in jarbes is built upon two submodels: one that contains the parameters of interest (e.g., a pooled mean across studies) and another that accounts for biases. The biases submodel addresses hidden factors that may distort study results (e.g., selection bias, dilution bias, reporting bias) and are not directly observable. This model-building strategy allows the model of bias to correct the meta-analysis affected by biased evidence. We present two real examples of applying the Bayesian nonparametric modeling functionality of jarbes. The first combines studies of different types and quality, and the second shows the effect of bias correction in nonparametric meta-regression. References Verde, P. E. (2024), “jarbes: An R Package for Bayesian Evidence Synthesis.” Version 2.2.3. https://CRAN.R-project.org/package=jarbes Verde, P. E. and Rosner, G. L. (2025), A Bias-Corrected Bayesian Nonparametric Model for Combining Studies With Varying Quality in Meta-Analysis. Biometrical Journal., 67: e70034. https://doi.org/10.1002/bimj.70034Date and time: Sat, Aug 9, 2025 - 10:30–12:00 Author(s): Pablo Verde (University of Dusseldorf) Keyword(s): meta-analysis, bayesian nonparametrics, bias-correction, evidence synthesis Video recording available after conference: ✅ |
Pablo Verde (University of Dusseldorf) |
| Case studies | |||
| 10:30–12:00 | Penn Garden | From Copy-Paste Chaos to Reproducible Workflows: A Wet Lab Researcher’s Journey into RMore infoAs a wet lab researcher, I used to struggle with fragmented data analysis workflows. I was taught: You do your experiments, you get your data, you copy-paste into separate software packages for descriptive statistics, visualisation, and documentation. I was constantly frustrated with data analysis: Change something early in the analysis? Go back and copy-paste. How did I analyse similar data sets previously while working at a different institute? Good luck opening that proprietary file format without that software and the license. Learning R transformed how I approach data, not just by replacing individual tools but reshaping my entire understanding of analysis. Beyond statistics, R introduced me to better data organisation, reproducible analysis, meaningful visualisation, and a community dedicated to improving data analysis and reporting. Working with R taught me more than any course on data analysis ever did. Now I use RMarkdown and Quarto daily to document and report my research. These tools allow me to standardise workflows, making my analyses reproducible and independent of proprietary software that might not be available in all research settings. Beyond improving my own work, these tools have become invaluable for guiding students, e.g. providing example workflows for common assays, and visualisations to help them better understand their data. In my talk, I will share my journey from chaotic spreadsheets to a reproducible, streamlined workflow. I will showcase the specific tools I use and how they have improved my research. Lastly, I will invite other wet lab researchers to discuss how these tools can help address reproducibility challenges in data analyses.Date and time: Sat, Aug 9, 2025 - 10:30–12:00 Author(s): Anna Jaeschke Keyword(s): wet lab research, workflow, experimental research Video recording available after conference: ✅ |
Anna Jaeschke |
| 10:30–12:00 | Penn Garden | Readability: New ways to improve communication at the Central Bank of ChileMore infoThis study presents the development of a Shiny application, created entirely within the Central Bank of Chile, to improve the readability of its monetary policy communications. Effective communication is essential for central banks, as it influences expectations and decision-making. However, technical language and complex sentence structures often hinder comprehension. Initially, readability was assessed using the perspicuity index, an adaptation of the Flesch-Kincaid index. However, this method does not identify the specific sources of difficulty, especially in Spanish. To address this, a new theoretical framework was developed, identifying five key complexity dimensions: (1) nominalization, (2) gerunds, (3) depth of dependency, (4) subordinations, and (5) language complexity. Using Natural Language Processing (NLP), the Shiny application detects readability challenges by: 1. Calculating the percentage of sentences with readability issues. 2. Highlighting complex structures within the text. 3. Providing sentence-level breakdowns of readability difficulties. 4. Comparing language complexity against graded dictionaries. Applying this tool to monetary and financial policy reports since 2018 revealed that approximately 30% of the content contains readability challenges. The monetary policy summaries correlate strongly with the perspicuity index, indicating that most readability issues stem from syntactic complexity. In contrast, financial policy summaries show lower correlation, as their difficulty arises from long words and technical terms. Since its first use in December 2022, the application has played a key role in reducing text complexity in official reports. However, an increase in complexity in June 2023, following a change in report authorship, underscores the importance of user adoption in ensuring consistent readability improvements. Ultimately, this initiative highlights the need for tailored readability strategies across different policy instruments. While monetary policy documents benefit from structural simplifications, financial policy texts require a more nuanced approach that considers both syntax and terminology. Additionally, the study demonstrates that institutional willingness to adopt readability tools significantly impacts communication effectiveness. By developing this Shiny application, the Central Bank of Chile has taken a significant step toward improving policy communication, ensuring greater clarity and accessibility for diverse audiences.Date and time: Sat, Aug 9, 2025 - 10:30–12:00 Author(s): Valentina Cortes Ayala (Central Bank of Chile); Karlla Munoz (Central Bank of Chile) Keyword(s): shiny, communication, central bank, readability Video recording available after conference: ✅ |
Valentina Cortes Ayala (Central Bank of Chile) |
| 10:30–12:00 | Penn Garden | Using R to Track, Monitor and Detect Changes in Movement and Diving Patterns of Beaked Whales off Cape Hatteras, NCMore infoBeaked whales can regularly dive to depths over 2,000m and during these dives hold their breath for over an hour. Understanding this physiological feat, as well as how individuals might alter their behavior when confronted with anthropogenic noise in the form of naval sonar is a daunting task that requires a diverse team of biologists, data scientists and statisticians. Here we report how we use R as part of a multiyear experiment off Cape Hatteras, NC, where we have monitored the behavior of 117 individual whales across 23 sonar exposures. Using biologging devices that are attached to individual whales, we record data on their acoustic behavior, diving kinematics and swimming behavior across multiple temporal and spatial scales. Using R, we focus our analysis on records detailing diving data every five minutes for two weeks and coarser movement data for approximately one month. Our workflow includes using structured EDA with bespoke R code to examine patterns before and after exposure; R packages (ctmcmove, walkMI) to fit continuous-time discrete space models to movement; and R packages (momentuhmm) to fit multi-state hidden Markov models to the dive data. We bring these together with 4D modeled data on sound propagation in the water column. This workflow allows us to parameterize dose-response models within a Bayesian model written in jags to quantify how exposure impacts behavior in this family of deep diving whales.Date and time: Sat, Aug 9, 2025 - 10:30–12:00 Author(s): Rob Schick (Southall Environmental Associates, Inc.) Keyword(s): animal movement, diving, dose-response, hierarchical bayes, workflows Video recording available after conference: ✅ |
Rob Schick (Southall Environmental Associates Inc.) |
| 10:30–12:00 | Penn Garden | useR to Analyze Emergency Medical and Trauma DataMore infoEmergency Medical Services (EMS) and trauma centers provide life-saving care in critical moments. To support data-driven quality improvement in these high-stakes environments, thenemsqar and traumar R packages were developed to automate performance metric calculations for EMS and trauma care. This talk introduces nemsqar and traumar, which help researchers, data analysts, and public health professionals efficiently process standardized data and generate actionable insights. The nemsqar package simplifies the implementation of National EMS Quality Alliance (NEMSQA) performance measures. It processes National Emergency Medical Services Information System (NEMSIS) data, automating complex quality metric calculations to reduce errors, save time, and support prehospital care decision-making. The traumar package focuses on in-hospital trauma care, offering functions for risk-adjusted mortality metrics and other trauma quality indicators. Designed for flexibility, it supports multiple data sources and advanced statistical modeling to improve patient outcome assessments. This presentation will showcase real-world applications of both packages, demonstrating how they streamline quality reporting and enhance research efficiency. Attendees will see key functionalities, practical use cases, and integration strategies. Finally, the talk will highlight opportunities for community involvement, including contributions to package development, validation efforts, and feature expansion to meet evolving needs.Date and time: Sat, Aug 9, 2025 - 10:30–12:00 Author(s): Nicolas Foss (Bureau of Emergency Medical and Trauma Services, Division of Public Health, Iowa Health and Human Services) Keyword(s): ems, trauma, mortality, quality improvement, healthcare Video recording available after conference: ✅ |
Nicolas Foss (Bureau of Emergency Medical and Trauma Services Division of Public Health Iowa Health and Human Services) |
| Clinical trials | |||
| 10:30–12:00 | Gross 270 | Identifying Adverse Event Under-Reporting in Clinical Trials: A Statistical ApproachMore infoAdverse event (AE) detection is a critical component of clinical trials, yet we know that AE underreporting is a concern with traditional reporting methods. This project reviews AE under-reporting best practices and introduces a new AI/ML framework for detecting un-reported AEs using R. This effort is being implemented under the Phuse OpenRBQM project. OpenRBQM is a collaborative effort to create open-source R packages focused on risk-based quality management (RBQM). First, we introduce the {gsm} and {simaerep} packages which facilitate site- and country- level assessments of AEs. The {gsm} or Good Statistical Monitoring package provides a standardized framework for calculating Key Risk Indicators (KRIs) across all aspects of RBQM, including AE monitoring. The {simaerep} package, developed by the IMPALA consortium, uses advanced statistical methodologies to simulate AE reporting in clinical trials to detect under-reporting sites. The IMPALA and OpenRBQM teams have collaborated to create the {gsm.simaerep} package for use in the {gsm} framework. Finally, we present a new approach that leverages AI/ML techniques to identify specific missed AEs by analyzing data from other clinical trial domains. Using R, we develop models that detect patterns and highlight anomalies indicative of unreported AEs. By applying these methods to real-world clinical trial datasets, we demonstrate how AI/ML can enhance RBQM efforts. This presentation introduces tools that combines standard RBQM methodologies for evaluating adverse event under-reporting with AI methods for identifying specific missed AEs. Attendees will gain insights into implementing R-based techniques to uncover hidden safety signals in clinical research data.Date and time: Sat, Aug 9, 2025 - 10:30–12:00 Author(s): Laura Maxwell (Atorus Research), Jeremy Wildfire (Gilead Sciences) Keyword(s): clinical trials, pattern recognition, simulation, ai/ml, biostatistics Video recording available after conference: ✅ |
Laura Maxwell (Atorus Research) Jeremy Wildfire (Gilead Sciences) |
| 10:30–12:00 | Gross 270 | Implementing function factories for flexible clinical trial simulationsMore infoThe R package {simtrial} simulates clinical trial results using fixed or group sequential designs. One of its advantages is it provides the user with sufficient flexibility to define complex stopping rules for specifying when intermediate analyses are to be performed and which tests are to be applied at each of these analyses. However, this flexibility in the design generates complexity when automating the simulations. In order to provide the desired flexibility while implementing a maintainable simulation framework, I applied a function factory strategy. Function factories are functions that return another function. This enables the user to define any arbitrary set of argument values, but then delay the execution of the function until the simulation is performed. In this presentation, I will provide an overview of function factories and explain how I implemented them in {simtrial}.Date and time: Sat, Aug 9, 2025 - 10:30–12:00 Author(s): John Blischak Keyword(s): functional programming, simulations, function factories, clinical trials, group sequential design Video recording available after conference: ✅ |
John Blischak |
| 10:30–12:00 | Gross 270 | Reproducible integrated processing of a large investigator-initiated, randomized-controlled multicenter clinical trial using Quarto and RMore infoNon-pharmaceutical clinical research often lacks reproducibility in data processing and analysis. In investigator-initiated trials, where financial resources are scarce, medical researchers must handle data management and analysis themselves, often using suboptimal tools. We present here the use case of a large, multicenter randomized-controlled trial in anesthesiology with over 2,500 enrolled patients. Embedded in a single Quarto-based project using tidyverse-style R, we processed the complete dataset from the electronic case report form from data tidying and analysis through plotting, report drafting, and presentation preparation. Our workflow is fully transparent, reproducible, and adaptive, following approaches demonstrated by Mine Çetinkaya-Rundel at R/medicine and Joshua Cook at posit:conf in 2024. To our knowledge, this represents the largest clinical trial managed using this methodology. This work demonstrates that accessible tools for tidy and reproducible scientific data processing are available even to researchers who are not native data scientists.Date and time: Sat, Aug 9, 2025 - 10:30–12:00 Author(s): Benedikt Schmid (University Hospital Würzburg, Department of Anaesthesiology, Intensive Care, Emergency and Pain Medicine, Würzburg, Germany); Robert Werdehausen (Department of Anesthesiology and Intensive Care Medicine, University Hospital Leipzig, Germany), Christopher Neuhaus (Department of Anesthesiology, University Hospital Heidelberg, Heidelberg, Germany), Linda Grüßer (Department of Anaesthesiology, RWTH Aachen University Hospital, Germany), Peter Paal (Department of Anaesthesiology and Intensive Care Medicine, Hospitallers Brothers Hospital, Paracelsus Medical University, Salzburg, Austria), Patrick Meybohm (University Hospital Würzburg, Department of Anaesthesiology, Intensive Care, Emergency and Pain Medicine, Würzburg, Germany), Peter Kranke (University Hospital Würzburg, Department of Anaesthesiology, Intensive Care, Emergency and Pain Medicine, Würzburg, Germany), Gregor Massoth (Department of Anaesthesiology and Intensive Care Medicine, University Hospital Bonn, Germany) Keyword(s): medical research, reproducible workflow, randomized controlled clinical trial Video recording available after conference: ✅ |
Benedikt Schmid (University Hospital Würzburg Department of Anaesthesiology Intensive Care Emergency and Pain Medicine Würzburg Germany) |
| 10:30–12:00 | Gross 270 | Retrospective clinical data harmonisation reporting using R and QuartoMore infoThere has been an increase of projects that involved data pooling from multiple sources. This is because combining data is an economical way to increase the statistical power of an analysis of a rare outcome that could not be addressed using data from a single project. Prior to statistical or machine learning analysis, a data steward must be able to sort through these heterogeneous inputs and document this process in a coherent way for different stakeholders. Despite its importance in the big data environment, there are limit resources on how to document this process in a structured, efficient and robust way. This presentation will provide an overview on how I create clinical data harmonisation reports using some R packages and a Quarto book project. A small preview can be found in https://github.com/JauntyJJS/harmonisation The audience in this talk will be able to know the basic framework of creating a Quarto Book or website to document data harmonisation processes, the basic workflow during the data harmonisation process, how to do data validation when writing code for data harmonisation to ensure code workflow is robust to changes in the input data, ways to show to higher management (with limited programming experience) in the harmonisation report that your code works (It is not enough to say that I use test units), able to write an R script to create many data harmonisation reports (One technical report for each cohort pooled and one report that summarised the data harmonisation process for all cohorts).Date and time: Sat, Aug 9, 2025 - 10:30–12:00 Author(s): Jeremy Selva (National Heart Centre Singapore) Keyword(s): data harmonisation, data validation, report making automation, quarto Video recording available after conference: ✅ |
Jeremy Selva (National Heart Centre Singapore) |
| Package lifecycle | |||
| 13:00–14:10 | Penn 1 | ARcenso: A Package Born from Chaos, Powered by CommunityMore infoHistorical census data in Argentina is scattered across multiple formats: books, spreadsheets, PDFs, and REDATAM, without a standardized structure. This lack of organization complicates analysis, requiring manual cleansing and integration of records before working with the data. As R users, we recognized an opportunity to transform this chaos into a meaningful solution not only for personal use but for all R users. That is how {arcenso} was born, a way to provide structured, ready-to-use census data, eliminating repetitive pre-processing and allowing users to focus on analysis with harmonized datasets. The goal is to make national census data in Argentina more accessible. Through the rOpenSci Champions program, the original idea turned into a functional R package. Thanks to the support of the R community, we learned how to structure the package, document datasets, and ensure reproducibility. This journey demonstrated the value of community learning, and those principles are embedded in {arcenso}, making it accessible and user-friendly. {arcenso} is currently developing and has released its first dataset along with three core functions. However, this is just the beginning. There are more datasets to integrate, additional features to develop, and improvements to be made to enhance the user experience. In this talk, we will introduce the package for users from both public and private sectors, including academics and researchers facing data challenges. We will explain the framework used for turning problems into solutions, highlight tools and community resources, and try to inspire others to tackle their own data challenges.Date and time: Sat, Aug 9, 2025 - 13:00–14:10 Author(s): EMANUEL CIARDULLO, ANDREA GOMEZ VARGAS; EMANUEL CIARDULLO Keyword(s): package, community, workflow, census, official statistics Video recording available after conference: ✅ |
EMANUEL CIARDULLO ANDREA GOMEZ VARGAS |
| 13:00–14:10 | Penn 1 | Curating a Community of Packages: Lessons from a Decade of rOpenSci Peer ReviewMore infoThemed collections of packages have long been a common feature of the R ecosystem, from the CRAN Task Views to today's "universes". These range from tightly integrated toolboxes engineered by a single team, to journal-like repositories of packages passing common standards, or loose collections of packages organized around communities, themes, or development approaches. This talk will share insights for managing package collections, and their communities of developers, gleaned from a decade of rOpenSci's software peer-review initiatives. I will cover best practices for governing and managing collections, determining scope and standards for packages, onboarding and offboarding, and supporting continuing maintenance. Finally, I will discuss the essential role of mentorship and inclusive practices that support a diverse community of package maintainers and contributors.Date and time: Sat, Aug 9, 2025 - 13:00–14:10 Author(s): Noam Ross (rOpenSci) Keyword(s): standards, interoperability, maintenance, mentorship, community Video recording available after conference: ✅ |
Noam Ross (rOpenSci) |
| 13:00–14:10 | Penn 1 | rtables: Challenges, Advances and Lessons Learned Going Into Production At J&JMore infortables is an open-sourced framework for the creation of complex, multi-faceted tables developed by the author while at Roche. Here, we will discuss the process of adopting rtables at J&J as the lynchpin of a larger transition to R and open-source tools for the creation of production outputs in clinical trials. In particular we will touch on 3 aspects: development of novel features in rtables required to meet J&J's specific needs, development of additional tooling around rtables for use by the company's SPAs, and lessons learned during the process.Date and time: Sat, Aug 9, 2025 - 13:00–14:10 Author(s): Gabe Becker (Independent) Keyword(s): clinical trials, tables, tlg, visualization, Video recording available after conference: ✅ |
Gabe Becker (Independent) |
| Teaching 1 | |||
| 13:00–14:10 | Penn 2 | Coursework RStudio Infrastructure at scale: Duke and NCShareMore infoTwo case studies covering lessons learned running large scale RStudio infrastructure for coursework at Duke University and [NCShare][1] (an NSF-funded consortium to advance scientific computing and innovate STEM education at North Carolina’s historically marginalized institutions). Each semester Duke provides containerized RStudio instances for over 1200 students. Similar instructure is used in the NCShare consortium to support advanced computing environments to less-resourced Higher Ed institutions. This talk covers best practices and pitfalls for automation, packaging, management, and support of RStudio and how cross institutional collaboration can make these environments more widely available. [1]: https://ncshare.org/Date and time: Sat, Aug 9, 2025 - 13:00–14:10 Author(s): Mark McCahill (Duke University) Keyword(s): educational consortia, coursework infrastructure, automation Video recording available after conference: ✅ |
Mark McCahill (Duke University) |
| 13:00–14:10 | Penn 2 | Enhancing R Instruction: Adapting Workshops for Time-Constrained LearnersMore infoThe Data & Visualization Services Department at North Carolina State University Libraries offers data science support to faculty, staff, students, and the broader community. This support includes data science consulting, workshops and instruction on data science and programming topics, as well as specialized computer lab spaces equipped with hardware and software for data and visualization work. Among these, our introductory workshops are particularly popular. Our Intro to R workshop series consists of three sessions covering basic programming, data cleaning, and data visualization. Participants come from diverse academic and professional backgrounds, with varying levels of coding experience—from no prior exposure to limited familiarity with R or other programming languages. Additionally, they must balance academic, professional, and personal commitments, making it essential to provide efficient yet comprehensive instruction. We recently refined our curriculum to address these challenges in response to direct and observed student feedback. This presentation will explore the specific curriculum changes, the challenges they aim to resolve, and the role of instructor-led workshops in supporting early-stage R learners.Date and time: Sat, Aug 9, 2025 - 13:00–14:10 Author(s): Selene Schmittling (North Carolina State University); Shannon Ricci (North Carolina State University), Alp Tezbasaran Keyword(s): instruction, curriculum development, learner diversity, workshops Video recording available after conference: ✅ |
Selene Schmittling (North Carolina State University) |
| 13:00–14:10 | Penn 2 | Rhapsody in R: Exploring Probability Through MusicMore infoProbability is often introduced with applications from the natural and social sciences, but its role in the arts is less frequently explored. One example is stochastic music, pioneered by avant garde 20th-century composers like [Iannis Xenakis][1], who used probabilistic models and computer simulations to generate musical structures. While the aesthetic appeal of such music is subjective, its mathematical foundations offer a compelling way to engage students with probability and randomness. This talk presents an assignment for an introductory probability course where students compose their own stochastic music using R. By applying their knowledge of probability distributions and computer simulation, they explore randomization in pitch, rhythm, meter, instrumentation, and harmony — observing emergent patterns along the way. The R package [gm][2] by Renfei Mao provides a user-friendly framework for layering musical elements, while integration with MuseScore allows students to generate sheet music and MIDI playback. This activity not only reinforces key concepts, but also offers students a fun and creative way to apply probability, engaging a different part of their brain than traditional scientific applications. [1]: https://youtu.be/nvH2KYYJg-o?feature=shared [2]: https://cran.r-project.org/web/packages/gm/index.htmlDate and time: Sat, Aug 9, 2025 - 13:00–14:10 Author(s): John Zito (Duke University) Keyword(s): teaching, probability, music Video recording available after conference: ✅ |
John Zito (Duke University) |
| Web APIs | |||
| 13:00–14:10 | Penn Garden | Automating CDISC Metadata Retrieval: An R-Based Approach Using the CDISC Library APIMore infoThe CDISC Library API provides a programmatic gateway to clinical data standards, including SDTM and ADaM domains, variables, and controlled terminology. This presentation showcases an R-based approach to integrating the API for automated retrieval and structuring of CDISC metadata and controlled terminology, eliminating the need for manual extraction from PDFs or Excel files. Leveraging R packages such as shiny, httr2, jsonlite, and tidyverse, we demonstrate a reproducible workflow that queries the /mdr/sdtmig/{version} and /mdr/ct/{version} endpoints, parses JSON responses into structured data frames, and presents the results in a web application. Key topics include authentication via API keys, handling nested JSON structures, and ensuring seamless interaction with CDISC’s evolving standards. This approach enhances efficiency, reduces manual effort, and improves traceability in clinical data workflows.Date and time: Sat, Aug 9, 2025 - 13:00–14:10 Author(s): Jagadish Katam Keyword(s): cdisc, sdtm, adam, controlled terminology, shiny, api Video recording available after conference: ✅ |
Jagadish Katam |
| 13:00–14:10 | Penn Garden | Web APIs for useRs: Getting data from websites, databases, and LLMsMore infoMany websites and services provide APIs, and useRs can take advantage of them to get data, make database operations, and talk to Large Language Models (LLMs). The httr2 package, with its support for sequential and parallel requests, is a great tool for efficient API interactions. I will demonstrate its use through two real-world examples. First, I will introduce the frstore package, which I developed to interact with Google Firestore, a NoSQL database. While client libraries exist for Python and JavaScript, R users were left out—until now. frstore enables create, read, update, and delete (CRUD) operations using httr2, making it a powerful tool for R users working with Firestore. The second example is a Shiny app designed to create an immersive storytelling experience. Users provide the first sentence of a children’s story, and the app uses httr2 to interact with multiple APIs. Cloudflare’s Workers Model API is used to send requests to text generation and image generation models. Moreover, Eleven Labs’ API converts text to speech for audiobook-like narration. These results are integrated in a quarto revealjs slide deck that yields a delightful, interactive storytime experience. This talk is aimed at R users of all levels who want to expand their toolkit for web data access and API interactions. Whether you’re scraping data, working with APIs, or building interactive applications, this session will provide practical examples to enhance your R workflows.Date and time: Sat, Aug 9, 2025 - 13:00–14:10 Author(s): Umair Durrani (Presage Group) Keyword(s): api, httr2, llm, database, shiny Video recording available after conference: ✅ |
Umair Durrani (Presage Group) |
| 13:00–14:10 | Penn Garden | {plumber2}: Streamlining Web API Development in RMore infoOver the past nine years, the R package {plumber} has simplified the creation of web APIs using annotations over existing R source code with roxygen2-like comments. During this time, the community has gathered valuable insights and identified numerous areas for improvement. To invest in a way forward, a new package called {plumber2} has been created. {plumber2} is designed from the ground up to be highly extensible, enabling developers to easily integrate custom decorators to modify the behavior of their APIs. Furthermore, {plumber2} is built using a modern foundation, leveraging the latest packages associated with the {fiery} framework. This modern architecture is built upon middleware (the ability to introduce custom logic at specific points within the API's request handling process). One of the many fine-grained controls over how your API can behave. By incorporating these improvements and embracing a modern framework, {plumber2} offers a sustainable path forward for building web APIs in R. This new approach avoids the need for short-term fixes and ensures that {plumber2} can continue to evolve and adapt to the changing needs of developers.Date and time: Sat, Aug 9, 2025 - 13:00–14:10 Author(s): Barret Schloerke (Posit PBC); Thomas Pedersen (Posit PBC) Keyword(s): api, plumber2, plumber, package, web api Video recording available after conference: ✅ |
Barret Schloerke (Posit PBC) |
| High-dimensional data | |||
| 13:00–14:10 | Gross 270 | Generating interesting high-dimensional data structuresMore infoA high-dimensional dataset is where each observation is described by many features, or dimensions. Such a dataset might contain various types of structures that have complex geometric properties, such as nonlinear manifolds, clusters, or sparse distributions. We can generate data containing a variety of structures using mathematical functions and statistical distributions. Sampling from a multivariate normal distribution will generate data in an elliptical shape. Using a trigonometric function we can generate a spiral. A torus function can create a donut shape. High-dimensional data structures are useful for testing, validating, and improving algorithms used in dimensionality reduction, clustering, machine learning, and visualization. Their controlled complexity allows researchers to understand challenges posed in data analysis and helps to develop robust analytical methods across diverse scientific fields like bioinformatics, machine learning, and forensic science. Functions to generate a large variety of structures in high dimensions are organized into the R packagecardinalR, along with some already generated examples.Date and time: Sat, Aug 9, 2025 - 13:00–14:10 Author(s): Piyadi Gamage Jayani Lakshika (Monash University, Australia); Dianne Cook (Monash University, Australia), Paul Harrison (Monash University, Australia), Michael Lydeamore (Monash University, Australia), Thiyanga S. Talagala (University of Sri Jayewardenepura, Sri Lanka) Keyword(s): high-dimensional data structures, mathematical functions, statistical distributions, geometrics Video recording available after conference: ✅ |
Piyadi Gamage Jayani Lakshika (Monash University Australia) |
| 13:00–14:10 | Gross 270 | Introducing riemmtanMore infoThe statistical analysis of random variables that take values in Riemannian manifolds is a rapidly growing area of research. Its main application is the study of connectomes obtained from brain imaging, which belong to the manifold of symmetric positive definite matrices. Large amounts of work have been devoted to address a variety of issues including the development of new metrics, new statistical models and visualization techniques. Unfortunately, the tools offered by R to handle this type of data have not evolved with the speed necessary to match the momentum of this growing area of the statistical literature. The R packagesRiemann and frechet are important steps in that direction, but new tools are necessary to incorporate recent developments. That is why we are introducing riemmtan, a new R package. Its main goal is to offer a high level interface that abstracts away many day-to-day operations of this kind of analysis. In addition, it allows the user to exploit the growing capabilities of modern computer clusters by making use of parallelism in several parts of its implementation, including the computation of Fréchet means. Finally, it makes use of the object oriented programming tools in R to make Riemannian metrics self contained modules, allowing users to easily implement and experiment with new metrics. We hope riemmtan will become the foundation for an ecosystem of tools that allow for efficient and user-friendly analysis of Riemannian manifold valued data.Date and time: Sat, Aug 9, 2025 - 13:00–14:10 Author(s): Nicolas Escobar (Indiana University); Jaroslaw Harezlak (Indiana University) Keyword(s): riemannian manifolds, connectomics, fmri imaging Video recording available after conference: ✅ |
Nicolas Escobar (Indiana University) |
| 13:00–14:10 | Gross 270 | Multi-omics Integration with GAUDI: A Novel R Package for Non-linear Dimensionality Reduction and Interpretable Clustering AnalysisMore infoIntegrating high-dimensional multi-omics data presents significant challenges in computational biology, particularly when handling complex non-linear relationships across diverse biological layers. We present GAUDI (Group Aggregation via UMAP Data Integration), a novel R package that leverages Uniform Manifold Approximation and Projection (UMAP) for the concurrent analysis of multiple omics data types. GAUDI addresses key limitations of existing methods by enabling non-linear integration while maintaining interpretability and mitigating bias from datasets with vastly different dimensionalities. The GAUDI R package implements a straightforward yet powerful workflow: (1) independent UMAP embeddings are applied to each omics dataset, creating standardized representations that preserve dataset-specific structures; (2) these embeddings are concatenated; (3) a second UMAP transformation integrates these embeddings into a unified space; (4) hierarchical density-based clustering identifies sample groups; and (5) feature importance analysis via XGBoost and SHAP values enables biological interpretation. Our benchmarking against six state-of-the-art multi-omics integration methods demonstrates GAUDI's superior performance across diverse datasets. Using simulated multi-omics data with known ground truth, GAUDI achieved perfect clustering accuracy across all tested scenarios. In cancer datasets from TCGA, GAUDI identified clinically relevant patient subgroups with significant survival differences, particularly in acute myeloid leukemia where it detected high-risk subgroups missed by other methods. At the single-cell level, GAUDI not only correctly classified cell lines but uniquely identified biologically meaningful substructures within them, confirmed by differential expression and pathway enrichment analyses. When evaluating large-scale functional genomics datasets from the Cancer Dependency Map (DepMap) Project, GAUDI demonstrated superior lineage identification accuracy. In a benchmark integrating gene expression, DNA methylation, miRNA expression, and metabolomics across 258 cancer cell lines, GAUDI achieved the highest score for lineage discrimination, approximately 15% better than the next-best performing method, MOFA+, underscoring its effectiveness with complex, heterogeneous multi-omics data. The GAUDI R package provides a user-friendly interface with extensive documentation, visualization tools, and compatibility with standard bioinformatics workflows. By combining the strengths of non-linear dimensionality reduction with interpretable machine learning approaches, the GAUDI R package offers researchers a powerful new tool for exploring complex relationships across multiple biological data types, potentially revealing novel insights in systems biology, precision medicine, and biomarker discovery. Package: https://github.com/hirscheylab/gaudi Benchmark: https://github.com/hirscheylab/umap_multiomics_integrationDate and time: Sat, Aug 9, 2025 - 13:00–14:10 Author(s): Pol Castellano Escuder (Heureka Labs) Keyword(s): multi-omics integration, dimension reduction, clustering, statistical learning, interpretable machine learning, benchmarking Video recording available after conference: ✅ |
Pol Castellano Escuder (Heureka Labs) |
| Pragmatic programmer | |||
| 14:30–15:40 | Penn 1 | "How did you even think of that???" Techniques to code much fasterMore infoThis talk will present a totally different way of thinking about writing R code. This method is completely different from anything I have ever seen in the R community (or any data science community). This is the method I used to write four R packages - NumericEnsembles, ClassificationEnsembles, LogisticEnsembles and ForecastingEnsembles. The largest part of the code was written in 15 months, and was approximately 15,000 lines at that time. No AI was used in any of the code development. This is a totally different style of thinking, using the same set of R tools that everyone else can use. What is totally different is the thinking that goes into the code development, compared to what I've seen everywhere else. This talk will show how the same method may be applied to the work you are doing. Come prepared to see that the methods you've been using to think through solutions and write code that achieves reproducible results can be improved very significantly by improving your thinking, not necessarily your tools. There will be several practical examples and live demonstrations to show how you may use these methods in real coding situations. Improving your thinking can do much more for improving how you code than adding the latest tools. This presentation will demonstrate how that was done in the development of four packages that automatically build ensembles as part of the analysis process, and how you can use the same methods in your work.Date and time: Sat, Aug 9, 2025 - 14:30–15:40 Author(s): Russ Conte (Owner@dataaip.com) Keyword(s): "code better, efficient coding, fast coding" Video recording available after conference: ✅ |
Russ Conte (Owner@dataaip.com) |
| 14:30–15:40 | Penn 1 | Reusing 'ggplot2' code: how to design better plot helper functionsMore infoWrapping 'ggplot2' code into plot helper functions is a common way to make multiple versions of a custom plot without copying and pasting the same code over and over again. Helper functions can replace long and complex 'ggplot2' code chunks with just a single function call. However, if that single function is not designed carefully, the initial convenience can often turn into frustration. While helper functions can reduce the amount of code needed to remake a complicated plot, they often mask the underlying layered grammar of graphics, complicating further customisation and tweaking of the plot. This talk addresses how to design effective 'ggplot2' plot helper functions that maximise reuse convenience whilst preserving access to the elegant flexibility of layered plot composition. By studying existing 'ggplot2' extensions for producing calendar plots, we identify a number of common pitfalls, including overly specific function arguments and hidden data manipulations. Then, we discuss how to avoid these pitfalls and retain the benefits of 'ggplot2' by: separating data preparation from plotting, utilising list arguments for customisation, and providing transparent documentation. We illustrate these strategies using examples from the design of the 'ggtilecal' package, which provides helper functions for plotting calendars using thegeom_tile() geometry from ggplot2.Date and time: Sat, Aug 9, 2025 - 14:30–15:40 Author(s): Cynthia Huang (Monash University) Keyword(s): r package and function design, layered grammar of graphics, data visualisation, ggplot2 extensions Video recording available after conference: ✅ |
Cynthia Huang (Monash University) |
| 14:30–15:40 | Penn 1 | The Language of Data: How R Package Syntax Shapes Analysis and ThoughtMore infoFor most users in data science, analytics, and research, a package’s syntax or API is their primary interface with the software. While R provides a well-defined framework for creating packages that make programming accessible, syntax choices serve as key connection points between users and their data. R packages exhibit a range of syntax styles—from explicit to implicit, verbose to symbolic, and structured to flexible. Drawing on research on language, cognition, and user experience, this talk explores how syntax design in R packages shapes the way we interact with data, approach analysis, and solve complex problems. In this talk, I will examine syntax design in powerful and popular data wrangling software in R--data.table, dplyr, polars, and base R, comparing their approaches and discussing their impact on usability, interpretation, and problem-solving in data workflows. Attendees will leave with an understanding of syntax design, how current leaders in data wrangling design their syntax, and considerations for how these designs can impact user behavior.Date and time: Sat, Aug 9, 2025 - 14:30–15:40 Author(s): Tyson Barrett (Highmark Health) Keyword(s): data wrangling, programming, syntax, analytics Video recording available after conference: ✅ |
Tyson Barrett (Highmark Health) |
| Teaching 2 | |||
| 14:30–15:40 | Penn 2 | Expanding Data Science's Reach through Interdisciplinarity and the HumanitiesMore infoWhat does data science mean for those disciplines that don’t traditionally align themselves with this work? More specifically, how might instructors in the Humanities define — and teach — data science? How can the Humanities use data science to resist academic siloing and promote alignment across disciplines and methodologies? What is to be gained for traditional data science programs with a transdisciplinary understanding and application of data science? This presentation explores three courses developed by English instructors at North Carolina State University's Data Science and AI Academy: Data Visualization, Introduction to AI Ethics, and Storytelling with Data and AI. The presenters will explain how their Humanities backgrounds help them create courses that extend data science beyond traditional applications. They'll share examples of assignments that incorporate their disciplinary expertise while integrating core data science principles from the ADAPT model. Furthermore, by offering alternative perspectives on data science, we create "gateway" courses that attract students who might not otherwise enter the field. The presenters will also discuss how these courses achieve interdisciplinarity both through content and the student participants. The presenters will demonstrate how the three representative courses complement traditional data science curriculum (coding) by broadening the field's reach in two ways: 1. enhancing the overall educational experience for students and 2. creating access points for faculty who don't typically identify with data science, thus attracting instructors without traditional data science backgrounds. The presentation will conclude with reflections on lessons learned, challenges encountered, and strategies for institutions seeking to implement similar cross-disciplinary approaches. The presenters will share preliminary assessment data demonstrating student outcomes and discuss implications for the future of data science education across diverse academic contexts.Date and time: Sat, Aug 9, 2025 - 14:30–15:40 Author(s): Kelsey Dufresne (North Carolina State University), James Harr (Christian Brothers University), Christin Phelps (North Carolina State University) Keyword(s): interdisciplinary, data science, outreach, data visualization, ai Video recording available after conference: ✅ |
Kelsey Dufresne (North Carolina State University) James Harr (Christian Brothers University) Christin Phelps (North Carolina State University) |
| 14:30–15:40 | Penn 2 | Leveraging LLMs for student feedback in introductory data science coursesMore infoA considerable recent challenge for learners and teachers of data science courses is the proliferation of the use of LLM-based tools in generating answers. In this talk, I will introduce an R package that leverages LLMs to produce immediate feedback on student work to motivate them to give it a try themselves first. I will discuss technical details of augmenting models with course materials, backend and user interface decisions, challenges around evaluations that are not done correctly by the LLM, and student feedback from the first set of users. Finally, I will touch on incorporating this tool into low-stakes assessment and ethical considerations for the formal assessment structure of the course relying on LLMs.Date and time: Sat, Aug 9, 2025 - 14:30–15:40 Author(s): Mine Cetinkaya-Rundel (Duke University + Posit PBC) Keyword(s): r-package, teaching, education, feedback, ai, llm Video recording available after conference: ✅ |
Mine Cetinkaya-Rundel (Duke University + Posit PBC) |
| 14:30–15:40 | Penn 2 | Teaching Statistical Computing with R and PythonMore infoComputing courses can be daunting for students for a variety of reasons, including programming anxiety, difficulty learning a programming language in a second language, and unfamiliarity with assumed computer knowledge. In an ongoing attempt to teach statistical computing effectively, I developed a textbook intended for use in a flipped classroom setting where R and Python are taught concurrently. This approach allows students to learn programming concepts applicable to most languages, while developing skills in both R and Python that can be used in an increasingly multilingual field. In this talk, I discuss the book's design and how it integrates into a sequence of undergraduate and graduate computing courses. Along the way, we will talk about opinionated coding decisions, use of memes, comics, and YouTube tutorials, and other features integrated into this open-source textbook built with quarto and hosted on GitHub.Date and time: Sat, Aug 9, 2025 - 14:30–15:40 Author(s): Susan Vanderplas (University of Nebraska - Lincoln) Keyword(s): data science, education, statistical computing, python, reproducibility Video recording available after conference: ✅ |
Susan Vanderplas (University of Nebraska - Lincoln) |
| Workflows | |||
| 14:30–15:40 | Penn Garden | Building Agentic Workflows in R with axolotrMore infoLarge Language Models (LLMs) have revolutionized how we approach computational tasks, yet R users often face significant barriers when integrating these powerful tools into their workflows. Managing multiple API providers, handling authentication, and orchestrating complex interactions typically requires substantial boilerplate code and specialized knowledge across different service ecosystems. This presentation introduces axolotr, an R package that provides a unified interface for interacting with leading LLM APIs including OpenAI's GPT, Google's Gemini, Anthropic's Claude, and Groq. Through progressive examples of increasing complexity, we demonstrate how R users can seamlessly incorporate LLMs into their data science workflows - from simple one-off queries to sophisticated agentic systems. We begin with fundamental LLM interactions, showing how axolotr simplifies credential management and API calls across providers. Next, we explore function-based implementations that transform raw LLM capabilities into reusable analytical tools. Finally, we demonstrate how to build true agentic workflows where multiple LLM calls work together to maintain state, make decisions, and accomplish complex tasks autonomously. Attendees will learn: - How to quickly incorporate LLMs into existing R projects using a consistent interface - Techniques for creating functions that leverage LLM capabilities for data analysis and interpretation - Approaches for building agentic systems that can reason about data, maintain context, and operate iteratively - Practical strategies for managing costs, optimizing performance, and selecting appropriate models for different tasks This presentation provides both newcomers and experienced R users with the practical knowledge needed to harness the power of LLMs through a streamlined, R-native approach. By the end, attendees will have a roadmap for transforming their interaction with LLMs from simple API calls to sophisticated autonomous workflows that can dramatically enhance productivity and analytical capabilities.Date and time: Sat, Aug 9, 2025 - 14:30–15:40 Author(s): Matthew Hirschey Keyword(s): llms, ai, agents, natural language processing, workflow automation Video recording available after conference: ✅ |
Matthew Hirschey |
| 14:30–15:40 | Penn Garden | Data as code, packaging data as code with duckdb and S3More infoDuckDB and object storage (S3) offer a powerful and cost-effective way to store and access data. The R package provides an efficient method to document data processing, simplify user access, incorporate business logic, increase reproducibility, and leverage both code and data. This talk will use [cori.data.fcc][1] featuring the US FCC National Broadband Data, as a case study for a data package. We will discuss the advantages discovered during its development, challenges we encountered, and tips for others who wish to adapt these methods for their own needs. [1]: https://ruralinnovation.github.io/cori.data.fcc/Date and time: Sat, Aug 9, 2025 - 14:30–15:40 Author(s): Olivier Leroy; John Hall (Center on Rural Innovation) Keyword(s): duckdb, s3, data package, broadband Video recording available after conference: ✅ |
Olivier Leroy |
| 14:30–15:40 | Penn Garden | Machine Learning-Powered Metabolite Identification in R: An Automated Workflow for Identifying Metabolomics Dark MatterMore infoUpwards of 90% of small molecules detected in LC-MS/MS-based untargeted metabolomics are unidentified due to limitations in current analytical techniques. Although this “dark matter” can significantly contribute to disease diagnosis and biomarker discovery, current identification methods are costly and resource-intensive. This study addresses these challenges by developing a computational workflow in R to encode the tandem mass spectra into simplified structural fingerprints, which can be predicted and related to known fingerprints in molecular databases. The developed pipeline includes different R packages such as RSQLite, SF, rcdk, chemminer, caret, sparsepca, rinchi, and rpubchem which finally improves metabolite identification in untargeted metabolomics. A total of 2,973 mass spectra of known and unknown molecules from an in-house high resolution LC-MS/MS study were extracted from an SQL database (mzVault) using the RSQLite package. The collected spectra were converted into machine-readable numbers using the rawToHex and readBin functions from the SF package. SMILES representations of known molecules were obtained by querying their names against PubChem using the rpubchem package. The set of 166 Molecular ACCess System (MACCS) fingerprints were computed for known molecules based on their SMILES using rCDK and ChemmineR packages. In the next step, 166 random forest (RF) models were trained on MS2 spectra of known molecules to model the MACCS fingerprints using the caret package. Before training, spectral data were normalized and subjected to dimensionality reduction using robust sparse principal component analysis (rSPCA) via the sparsepca package. The trained RF models were applied to high-resolution MS2 spectra of unknown molecules to predict their MACCS fingerprints, which were then used for similarity searches in the Human Metabolome Database (HMDB) using the Tanimoto coefficient. Retrieved candidates from HMDB were further refined based on LogP, topological polar surface area (TPSA), molecular mass, and retention time. The workflow was tested on an LC-MS/MS dataset containing 1,071 known and 1,902 unknown compounds. Despite the high dimensionality, rSPCA reduced the data to 25 principal components, preserving 97% of variance. RF models achieved a mean accuracy of 0.87 in 3-fold cross-validation. On average, 4.1±11.31 unique HMDB molecules were listed for each unknown molecule, and the retrieved list was prioritized using a hybrid scoring function. Applying a Tanimoto similarity threshold (>0.7), this workflow identified at least one HMDB match for 1,079 unknowns, improving metabolite identification by 57%. The incorporation of a hybrid scoring system based on Tanimoto similarity and physicochemical properties enhanced candidate ranking and structural elucidation of unknown metabolites.Date and time: Sat, Aug 9, 2025 - 14:30–15:40 Author(s): Ahmad Manivarnosfaderani (University of Arkansas for Medical Sciences); Sree V. Chintapalli (University of Arkansas for Medical Sciences), Renny Lan (University of Arkansas for Medical Sciences), Hailemariam Abrha Assress (University of Arkansas for Medical Sciences), Brian D. Piccolo (University of Arkansas for Medical Sciences), Colin Kay (University of Arkansas for Medical Sciences) Keyword(s): metabolomics, cheminformatics, machine learning, identification Video recording available after conference: ✅ |
Ahmad Manivarnosfaderani (University of Arkansas for Medical Sciences) |
| Life sciences | |||
| 14:30–15:40 | Gross 270 | Co-occurrence Analysis And Knowledge Graphs For Biomedical ResearchMore infoThe analysis of data from large hospitals and healthcare providers comes with unique challenges. Electronic health records document information from patients’ visits as the diagnoses performed, medications prescribed, and more. To discover the best treatment options, facilitate early diagnosis, and understand co-morbidities and adverse effects, biomedical researchers extensively use co-occurrence analysis, which measures how features as diagnoses and medications are correlated with each other over time at the patient level. Results can then be merged between independent health systems while maintaining patient data privacy in a process called transfer learning, and insights can be organized, visualized and interpreted using knowledge graphs. Knowledge graphs model relationships between concepts, e.g. one medication “may treat” one disease. Biomedical research consistently shows that while large language models perform very well to discover similar concepts, as synonyms or closely related diagnoses, co-occurrence analysis and knowledge graphs perform better when trying to discover related concepts, as best treatment options or adverse effects. A large part of contemporary biomedical research is thus dedicated to merging results from pre-trained large language models and study-specific performed co-occurrence analyses. To help researchers efficiently perform co-occurrence analysis and knowledge graphs, we developed the nlpembeds and kgraph R packages. The nlpembeds package enables to efficiently compute co-occurrence matrices between tens of thousands of concepts from millions of patients over many years – which can prove challenging when taking into account not only codified data as diagnoses and medications but also natural language processing concepts extracted from clinicians notes (comments justifying why specific diagnoses were performed or medications prescribed). The kgraph package enables to measure the performance of the results, build the corresponding knowledge graphs, and visualize them as interactive Javascript networks. We used the packages to perform several studies as the analysis of insurance claims of 213 million patients (Inovalon), the visualization of Mendelian randomization meta-analyses performed by the Veterans Affairs, and the transfer learning between several institutions involved in the Center for Suicide Research and Prevention to build risk prediction models. In this talk, I will showcase the highlights of these packages, introduce their use, and demonstrate how to perform real-world interpretations useful for clinical research. Co-occurrence analysis and knowledge graphs enable to discover insights from large databases of electronic health records in order to improve our understanding of biomedical processes and the realities of large-scale and long-term patient care.Date and time: Sat, Aug 9, 2025 - 14:30–15:40 Author(s): Thomas Charlon (Harvard Medical School) Keyword(s): embeddings, knowledge graph, biomedical research, patient care, mental health Video recording available after conference: ✅ |
Thomas Charlon (Harvard Medical School) |
| 14:30–15:40 | Gross 270 | Counting Birds Two Ways: Joint models of species abundanceMore infoJoint species distribution models (JSDMs) enable ecologists to characterize relationships between species and their environment, infer interspecific dependencies, and predict the occurrence or abundance of entire ecological communities. Although several popular JSDM frameworks exist, the problem of modeling sparse relative abundance data remains an inferential and computational challenge for many. We describe two approaches and corresponding implementations within the context of a case study involving a large community of bird species surveyed across Finland. The first approach, hierarchical modeling of species communities, employs a generalized linear latent variable model and supports diverse data and sampling designs but falters when faced with sparse and overdispersed count data. The second approach, binary and real count decompositions, directly addresses limitations of log-linear multivariate count models but lacks some of the generality and extensibility.Date and time: Sat, Aug 9, 2025 - 14:30–15:40 Author(s): Braden Scherting (Duke University) Keyword(s): nan Video recording available after conference: ✅ |
Braden Scherting (Duke University) |
| 14:30–15:40 | Gross 270 | Detecting Read Coverage Patterns Indicative of Genetic Variation and MobilityMore infoRead coverage data is commonly used in bioinformatics analyses of sequenced samples. Read coverage represents the count of short DNA sequences that align to specific locations in a reference sequence. When plotted, one can visualize how read coverage changes along the reference sequence. Some read coverage patterns, like gaps and elevations in coverage, are associated with real biological phenomena like mobile genetic elements (MGEs) and structural variants (SVs), for example. MGEs are genetic sequences capable of transferring to new genomic locations where they may disrupt functioning genes. Structural variants (SVs) refer to small genetic differences between individuals or microbial populations caused by deletions, insertions, and duplications of gene sequences. MGEs and SVs are important to host health and while many tools have been developed to detect them, the vast majority are either database-dependent or are limited to detection of specific types of MGEs and SVs. Using gaps and elevations in read coverage is a more general detection method for diverse MGEs and SVs, however the manual inspection of coverage graphs is tedious, time consuming, and subjective. We developed an algorithm that detects distinct patterns in read coverage data and implemented it into two R packages- TrIdent and ProActive- that automatically identify, classify, and characterize read coverage patterns indicative of genetic variation and mobilization. Our read coverage pattern-matching algorithm offers a unique approach to sequence data analysis and our tools enable researchers to efficiently implement read coverage inspections into their standard bioinformatics pipelines.Date and time: Sat, Aug 9, 2025 - 14:30–15:40 Author(s): Jessie Maier (North Carolina State University); Craig Gin (North Carolina State University), Benjamin Callahan (North Carolina State University), Manuel Kleiner (North Carolina State University) Keyword(s): pattern-matching, bioinformatic tools, read coverage data, mobile genetic elements, structural variants Video recording available after conference: ✅ |
Jessie Maier (North Carolina State University) |
| Day 3: Sunday, August 10, 2025 | |||
| Room | Title, abstract, and more info | Presenter(s) | |
|---|---|---|---|
| Lightning | |||
| 10:30–12:00 | Penn 1 | An Interactive webR Approach to Teaching Statistical Inference to Behavioral Science StudentsMore infoIn many applied data analysis courses, null hypothesis significance testing (NHST) is introduced using a frequentist framework based on theoretical probability distributions. Yet, behavioral science students often struggle with NHST because its logic can seem counterintuitive. A common error is viewing the p-value as the probability that the null hypothesis is true, instead of recognizing it as the chance of obtaining data as extreme (or more extreme) than observed, assuming the null hypothesis is true. In contrast, a permutation test lets students derive p-values directly from data, making core ideas such as randomness, variability, and extreme outcomes more tangible. In my applied data science course for students of Psychology, I use webR to create interactive R-based activities that immerse students in statistical concepts, uncertainty visualization, and hands-on experimentation. This presentation showcases a webR-based tutorial where students explore permutation tests to examine differences between experimental conditions in a recent Psychological study published on the Center for Open Science repository. Through interactive resampling and dynamic visualizations of empirical null distributions, students gain insight into how random variation influences statistical results. They then use the generated empirical null distribution to assess the extremity of their observed test statistic, calculate p-values, and construct confidence intervals -- deepening their understanding of NHST. Running simulations within webR enables interactive, self-contained learning modules, allowing students to experiment with code in real time within structured educational materials that scaffold their learning. I will discuss how this approach boosts engagement, supports replicability, and lowers barriers to learning R-based analysis. I will share insights from student feedback, challenges encountered, and best practices for integrating webR into applied data science education.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Kimberly Henry (Colorado State University) Keyword(s): webr, statistical inference, applied data science, behavioral science Video recording available after conference: ✅ |
Kimberly Henry (Colorado State University) |
| 10:30–12:00 | Penn 1 | Bringing the fun of hex stickers to your R sessionMore infoOver the years, R users have embraced logos and stickers for packages and communities of practice as a fun way to show support for open-source projects. The logos themselves often reflect a project’s core attributes and not just vague visual branding. This talk describes the development and functionality of the hexsession package, which creates interactive hexagonal tiles for logos or custom images. Similar to arranging stickers on a laptop, we can tesselate the logos for our installed packages or for any arbitrary set of images and produce a responsive HTML tile with each image linking to its respective web page. This output, created using CSS and JavaScript behind the scenes, now integrates with web-based documentation platforms that use Quarto or RMarkdown for websites, vignettes, and online books. This can be useful for documenting metapackages, developer portfolios, or showcasing any interrelated sets of packages. Developing hexsession was a challenging but rewarding process, which was greatly facilitated by existing open-source resources and feedback from its small but helpful user base. What started as a silly but ambitious idea will hopefully mature into a way of visually showcasing the tools that power our projects and also bring us together as a community.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Luis D. Verde Arregoitia (Instituto de Ecología AC - INECOL) Keyword(s): hex, stickers, quarto, documentation Video recording available after conference: ✅ |
Luis D. Verde Arregoitia (Instituto de Ecología AC - INECOL) |
| 10:30–12:00 | Penn 1 | Celebrating R: Code Snippets from NC Public Health EpidemiologyMore infoUNC Injury Prevention Research Center collaborates with the NC Division of Public Health Injury & Violence Prevention Branch to track and prevent injuries. Injury epidemiology is broad; its scope includes: self-harm and other violence; motor vehicle, bicycle, and pedestrian crashes; overdoses, alcohol, and cannabis harms; and social drivers like child maltreatment and homelessness. Using short code examples that span these topic areas, this lightning talk will celebrate R by sharing favorite code patterns and snippets from a decade of public health epidemiology programming in research, public health practice, and graduate coursework settings here in North Carolina.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Michael Fliss (UNC Injury Prevention Research Center / NC Division of Public Health) Keyword(s): epidemiology, north carolina, public health, Video recording available after conference: ✅ |
Michael Fliss (UNC Injury Prevention Research Center / NC Division of Public Health) |
| 10:30–12:00 | Penn 1 | Facilitating Open-Source Transitions: Lessons Learned from a Hands-On R Training InitiativeMore infoThe increasing adoption of open-source tools like R is transforming clinical and real-world data analysis, improving visualization and reporting, reducing costs, and fostering industry-academia collaboration. However, transitioning from proprietary to open-source solutions present cultural and operational challenges for organizations, underscoring the need for proper training to overcome reproducibility problems. To address this, the Duke Clinical Research Institute, an academic research organization, developed a six-module introductory R training program tailored to varying experience levels. The curriculum begins by exploring the distinction between R and RStudio, including installation guidance, before progressing to core R fundamentals such as data structures and manipulation. Building on these fundamentals, the subsequent modules focused on practical applications by exploring table generation, functions customization, R Markdown for reproducible reporting, and advancing skills in data visualization with ggplot2 and statistical modeling. In the final module, learners consolidate their training by applying key concepts to real-world scenarios, thereby acquiring practical skills vital for clinical research. Several lessons about this training program were identified, providing a strategic foundation for practical open-source training initiatives in a work environment. One key insight was the challenge of securing in-person attendance due to the predominantly remote workforce. To overcome this, hybrid sessions consisting of online and in-person workshops with interactive exercises were implemented. During exercises, participants were divided into small breakout rooms, each led by an R expert, allowing for personalized support and creating a more engaging environment. Another key takeaway was the difficulty participants faced in understanding the distinction between R and RStudio, particularly when it came to the complexities of package management, which emerged as one of the more challenging aspects of the introductory session. In retrospect, providing overview materials beforehand could have better prepared attendees for these challenging concepts. To ensure accessibility, all sessions were recorded, enabling participants to revisit the content if they were unable to follow the live session or wished to refer back in the future. Lastly, to sustain engagement and foster continuous improvement, post-session feedback was systematically collected and used to refine future modules. Additionally, training materials were shared in advance to ensure participants were set up for success. Sustained support is vital for successful open-source adoption. We aim to foster continuous growth and drive meaningful contributions to the R community by offering Open-Source office hours, incorporating R proficiency into annual performance goals, and planning advanced training on package management, development, and container-based reproducibility.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Anna Giczewska; Brooke Alhanti, DaJuanicia Holmes (Duke Clinical Research Institute), Ronald Kamusiime (Duke Clinical Research Institute), Miloni Shah (Duke Clinical Research Institute) Keyword(s): open-source adoption, r training program, clinical research, reproducibility, continuous learning Video recording available after conference: ✅ |
Anna Giczewska |
| 10:30–12:00 | Penn 1 | Rediscovering R for Library Data Instruction with Google ColabMore infoAs the first data literacies lead at my university, I support users with a wide range of skills and preferences, from those working with spreadsheets to those comfortable writing code. While I have used R extensively in my research, I found that most users prefer Python for coding, so I adapted my workshops and services to meet that demand. However, when a faculty member recently requested an R data visualization workshop for a data analysis course, I welcomed the opportunity to revisit R in a meaningful way. In this lightning talk, I will share insights on balancing comprehensive data services with the limitations of time and resources. I will also highlight how Google Colab’s built-in R support has transformed the delivery of hands-on sessions by removing the necessity for software installations or IT approvals. This is especially significant at our Google-centric campus, where Colab integration is seamless and readily accessible to everyone. I hope to encourage discussion on strategies for integrating R into academic settings, particularly in environments where software access and institutional preferences for other tools present challenges.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Ahmad Pratama (Stony Brook University) Keyword(s): library data services, data visualization, data instruction, google colab, barriers to r adoption Video recording available after conference: ✅ |
Ahmad Pratama (Stony Brook University) |
| 10:30–12:00 | Penn 1 | Storytelling with ggplot2: Using custom functions to sequence visualizationsMore infoCustom functions using {ggplot2} can decrease repetition in the exploratory phase of an analysis, but they can also be incredibly useful for highlighting and telling stories when communicating final results. This session features a visualization sequence demonstrating how functions can break {ggplot2} figures into pieces for ease of interpretation in a presentation setting. Audience members will come away from the session with a broadened view of how transparency levels and color can be used as function arguments to reveal sections of visuals piece by piece.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): McCall Pitcher (Duke University) Keyword(s): data visualization, functions, ggplot2, data storytelling Video recording available after conference: ✅ |
McCall Pitcher (Duke University) |
| 10:30–12:00 | Penn 1 | The Art of the Question: Building an Effective R Consultation Program in an Academic LibraryMore infoIn data science consultations, the path to success often begins not with providing immediate technical solutions, but with asking the right questions. This presentation explores how the Data Science Consulting Program at North Carolina State University has developed a question-based framework for R programming consultations that transforms how researchers approach data analysis problems. We'll share our structured consultation methodology that helps patrons clarify research objectives, challenge underlying assumptions, and refine analytical approaches before writing a single line of code. Through case studies spanning multiple academic disciplines, we'll demonstrate how strategic questioning fueled by critical thinking has led researchers to revise their analytical strategies, discover more elegant solutions, and sometimes completely reframe their research questions—ultimately producing more robust and meaningful results. This talk will provide practical insights for anyone supporting R users across skill levels, including a typology of questions that promote deeper analytical thinking, strategies for training consultation staff in this approach, and assessment methods to measure success. Whether you're supporting colleagues in industry, mentoring students, teaching workshops, or building a consultation program, you'll learn skills to learn someone’s real versus stated need and communicate technical information back effectively. Join us to explore how the art of asking the right questions can transform technical consultations and position R experts as valuable contributors to the entire research process.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Abhinandra Singh (North Carolina State University); Selene Schmittling (North Carolina State University), Alp Tezbasaran (North Carolina State University), Shannon Ricci (North Carolina State University), Mara Blake (North Carolina State University) Keyword(s): consulting, research, critical thinking Video recording available after conference: ✅ |
Abhinandra Singh (North Carolina State University) |
| 10:30–12:00 | Penn 1 | gfwr, an R package to access data from the Global Fishing Watch APIsMore infoAt Global Fishing Watch, we create and publicly share knowledge about human activity at sea to enable fair and sustainable use of our ocean. By processing terabytes of global vessel position data transmitted via the Automatic Identification System (AIS) and applying machine learning models, we create the most comprehensive view of vessel activities around the world. Our datasets are open to everyone and our aim is to facilitate the access to them to the general scientific community. As part of this goal, we createdgfwr, an R package that communicates with our public APIs and retrieves our data through three main functions: - get_vessel_info() provides access to vessel information from hundreds of thousands of fishing and non-fishing vessels from all over the world, getting identity information based on AIS self-reported data, public registries and authorizations - get_event() retrieves event information calculated by our algorithms. This includes encounters at sea, loitering, port visits and fishing events by vessel. - get_raster() returns fishing effort based on AIS data. In the package documentation, we created vignettes that show how to concatenate these functions into comprehensive workflows that can be adapted depending on the researcher's needs. By offering access through R, we also contribute to more open, transparent and reproducible science. Since 2022, more than 400 users have used gfwr, mostly from institutions in North America and Europe. We are now conducting multilingual training workshops to expand this user base and help promote a culture of transparency in ocean governance.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Andrea Sánchez-Tapia (Global Fishing Watch) Keyword(s): apis, fisheries science, public data, r package, machine learning Video recording available after conference: ✅ |
Andrea Sánchez-Tapia (Global Fishing Watch) |
| 10:30–12:00 | Penn 1 | pretestcad: An R package to calculate PreTest Probability (PTP) scores for obstructive Coronary Artery Disease (CAD)More infoMost clinicians in cardiology use an online portal such as HeartScore to calculate a risk score for a patient. However, as risk scores continue to evolve and update themselves, it can be a tedious process to recalculate the risk score of many patients as these online portal could only do so one patient at a time. As such, there has been a rise of R package dedicated to calculating a patient's risk of having cardiovascular diseases such as CVrisk, RiskScorescvd and whoishRisk in an automated way. Despite the progress made, pretest risk scores for obstructive CAD is lacking. Hence an R package called pretestcad was made to fill this gap, allowing users to calculate these scores automatically for many patients. Examples of such scores are the 2012 CAD Consortium 2 (CAD2) PTP scores, 2017 PROMISE Minimal-Risk Score and the 2020 Winther et. al. Risk-Factor-weighted Clinical Likelihood (RF-CL) and Coronary Artery Calcium Score-Weighted Clinical Likelihood (CACS-CL) PTP which was recommended to be used in the 2024 ESC Guidelines. I hope that presenting this R package in this conference could not only raise awareness of the package in the medical field but also collaboration to make the R package more accessible and user friendly and to expand my knowledge of other pretest scores.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Jeremy Selva (National Heart Centre Singapore) Keyword(s): pretest probability, risk scores, r package, clinical/medical research Video recording available after conference: ✅ |
Jeremy Selva (National Heart Centre Singapore) |
| 10:30–12:00 | Penn 1 | propertee: Flexible Covariance Adjustment and Improved Standard Errors in Analyses of Intact ClustersMore infoIn studies with intact clusters, regressing outcomes on an indicator for treatment assignment along with covariates commonly provides an estimate of the intent-to-treat (ITT) effect, with robust sandwich standard errors addressing clustering and possibly heteroskedasticity. Even when treatment assignment is ignorable under the study design, the parametric structure necessary for consistent effect estimation limits the gains in precision one seeks by including covariates. In contrast, differencing estimators take the difference between treated and control individuals in their average difference between outcome and some proxy for confounding effects. The propertee package offers users the opportunity to use a prediction of the outcome under control from a flexible “first-stage” model fit as this proxy. The Neyman variance estimator, the default standard error for difference-in-means estimates, fails to account for sampling variability from the model predictions when applied to this ITT effect estimator. The propertee package addresses this issue by augmenting a novel cluster-robust jackknife estimate of the sampling variability of the difference-in-means with a heteroskedasticity-robust estimate of the variability of the model coefficient estimates. This standard error provides asymptotically valid inference for the ITT effect when the first-stage model dimension grows sufficiently slowly with the size of the fitting sample, while entirely removing downward bias in finite samples, even when the model dimension grows more quickly than the rate necessary for asymptotic unbiasedness.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Joshua Wasserman (University of Michigan - Ann Arbor); Ben Hansen (University of Michigan - Ann Arbor) Keyword(s): cluster-randomized trials, clustered observational studies, cluster-robust standard errors, causal inference, intent-to-treat effect Video recording available after conference: ✅ |
Joshua Wasserman (University of Michigan - Ann Arbor) |
| 10:30–12:00 | Penn 1 | tinytable: A lightweight package to create simple and configurable tables in a wide variety of formatsMore infoThe R ecosystem offers a wide range of packages for generating tables in various formats. However, existing solutions often suffer from complexity, excessive dependencies, or rigid formatting systems. In response to these challenges, we introduce tinytable, a lightweight yet powerful R package for producing high-quality tables in multiple formats, including HTML, LaTeX, Word, PDF, PNG, Markdown, and Typst. tinytable is designed with a minimalist and intuitive interface while providing extensive customization options. Unlike many existing table-drawing packages, tinytable adheres to a strict design philosophy centered on three principles: separation of data and style, flexibility, and lightweight implementation. First, tinytable ensures that table content remains distinct from formatting instructions, enabling users to generate clean, human-readable code that is easier to edit and debug. Second, the package leverages well-established frameworks such as Bootstrap for HTML and tabularray for LaTeX, providing robust and highly customizable styling capabilities. Third, tinytable prioritizes a lightweight implementation by importing zero third-party R packages by default, reducing computational overhead and improving maintainability. The package was developed to address key limitations observed in existing table-drawing tools, specifically aiming for a simple, flexible, concise, and safe user experience. tinytable provides a streamlined API with a minimal learning curve, allowing users to generate high-quality tables with less code while ensuring strong input validation and informative error messages. Its small and maintainable codebase avoids excessive reliance on regular expressions, making it both efficient and transparent. This presentation will showcase tinytable’s capabilities through applied examples, demonstrating how users can create aesthetically pleasing tables with minimal effort while retaining complete control over their formatting. By offering a zero-dependency, highly customizable, and human-readable approach to table generation, tinytable represents a valuable addition to the R ecosystem for data analysis and reporting.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Vincent Arel-Bundock (Université de Montréal) Keyword(s): table formatting, latex, html, markdown, reproducible research Video recording available after conference: ✅ |
Vincent Arel-Bundock (Université de Montréal) |
| Modeling 2 | |||
| 10:30–12:00 | Penn 2 | Bayesian Variable Selection and Model Averaging in R using BASMore info[Placeholder text]Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Merlise Clyde (Duke University) Keyword(s): nan Video recording available after conference: ✅ |
Merlise Clyde (Duke University) |
| 10:30–12:00 | Penn 2 | Integrating R Models into Automated Machine Learning Pipelines with SASMore infoMany data scientists rely on automated pipelines to streamline predictive modeling and machine learning workflows. But did you know how easy it is to incorporate R models into these pipelines using SAS software? Whether you're working in a mixed-language environment or want to enhance automation, integrating R with SAS Model Studio allows you to seamlessly compare, select, and deploy models within a structured framework. Join us to explore how open-source models can fit into an end-to-end machine learning process, enabling efficiency, reproducibility, and scalability.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Rachel McLawhon Keyword(s): multi-language integration, model pipelines Video recording available after conference: ✅ |
Rachel McLawhon |
| 10:30–12:00 | Penn 2 | Multimedia: An R package for multimodal mediation analysisMore infoMediation analysis has emerged as a versatile tool for answering mechanistic questions in microbiome research because it provides a statistical framework for attributing treatment effects to alternative causal pathways. Using a series of linked regressions, this analysis quantifies how complementary data relate to one another and respond to treatments. Despite these advances, existing software’s rigid assumptions often result in users viewing mediation analysis as a black box. We designed the multimedia R package to make advanced mediation analysis techniques accessible, ensuring that statistical components are interpretable and adaptable. The package provides a uniform interface to direct and indirect effect estimation, synthetic null hypothesis testing, bootstrap confidence interval construction, and sensitivity analysis, enabling experimentation with various mediator and outcome models while maintaining a simple overall workflow. The software includes modules for regularized linear, compositional, random forest, hierarchical, and hurdle modeling, making it well-suited to microbiome data. We illustrate the package through two case studies. The first re-analyzes a study of Inflammatory Bowel Disease patients, uncovering potential mechanistic interactions between the microbiome and disease-associated metabolites, not found in the original study. The second analyzes new data about the influence of mindfulness practice on the microbiome. A gallery of examples and further documentation can be found at https://go.wisc.edu/830110.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Kris Sankaran Keyword(s): causal inference, microbiome, biostatistics, r package, modularity Video recording available after conference: ✅ |
Kris Sankaran |
| 10:30–12:00 | Penn 2 | distfreereg: A New R Package for Distribution-Free Parametric Regression TestingMore infoGoodness-of-fit testing is a crucial step in verifying the reliability of inferences drawn from a parametric regression model. It helps avoid invalid conclusions based on false assumptions. For example, the p-value associated with a coefficient in a linear model is unreliable if the mean function being used does not agree sufficiently with the data. Until now, there has been no easy and reliable way in R to test formally whether or not the mean function of a parametric regression model agrees with the data. In my presentation, I shall discuss my new R package, distfreereg, that implements the distribution-free goodness-of-fit testing procedure for parametric regression models introduced by Estate Khmaladze in 2021. I shall outline Khmaladze's algorithm, discuss the main features of the package, and illustrate its use with examples.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Jesse Miller (University of Minnesota) Keyword(s): goodness-of-fit testing, parametric regression, regression modeling, empirical partial sum process, r package Video recording available after conference: ❌ |
Jesse Miller (University of Minnesota) |
| R in organizations | |||
| 10:30–12:00 | Penn Garden | Implementing Posit® Team: Lessons LearnedMore infoWe are Smithfield Premium Genetics (SPG), a small department in Smithfield Foods that analyzes and reports on a wide variety of data. Our primary tools for reporting and visualization since 2012 have been open source versions of R, R-Studio Server and Shiny Server. In the summer of 2022, we committed to installing Posit® Team in order to simplify package management, collaborative development and report publication. Initially, SPG expected all components of Posit® Team to be operational within six months. The actual time was closer to two and a half years. We struggled with strict corporate IT protocols, missed deadlines, service provider selections and legal issues. In hind sight, six months was an unreasonable expectation, but the actual waiting period could have been shortened dramatically had we prepared properly. The purpose of this presentation is to shed light on some of SPG’s roadblocks to implementation and possibly help other small data science groups implement Posit® Team.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Lowell Gould (Smithfield Foods) Keyword(s): posit team, collaboration, workflow Video recording available after conference: ✅ |
Lowell Gould (Smithfield Foods) |
| 10:30–12:00 | Penn Garden | Navigating the Transition: Lessons from Adopting an Open-Source ApproachMore infoMany organizations are looking to switch to open-source software for its cost-effectiveness, greater flexibility, and better long-term maintainability compared to other software. For our project, this transition was driven by the need for a more cost-effective software tool and to reduce dependency on proprietary software services. However, this transition can present challenges such as training for team members, balancing training with other work commitments, and quality assurance for new programs. We describe the transition to R and RStudio over a 3-month period on a public health project with a team that had varying levels of R programming experience. Included in this discussion is our approach to training and implementation, and results for the project transition as we convert programs focused on data quality checks, data manipulation and processing, and statistical estimation. We touch on lessons learned on how to train new programmers to use R, what worked and did not work in terms of converting existing code, and resource planning needed to further continue this transition across other projects.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Carlos Petzold (RTI International); Adam Lee (RTI International) Keyword(s): training, implementation, public health, open-source, transititon Video recording available after conference: ✅ |
Carlos Petzold (RTI International) |
| 10:30–12:00 | Penn Garden | R & Python play nice, in productionMore infoHi, I’m Claudia Peñaloza, a Data Scientist at Continental Tires, where going data-driven can be an adventure. What started as a proof of concept a few years ago, evolved into Conti’s first-ever Predictive Machine Learning Model for R&D! A wedding, two babies, three lateral moves, and four hires later, our team had also evolved… from mostly R to mostly Python developers. Rewriting 1000+ commits? No thanks. Instead, we got R and Python to play nice. With Renv, Poetry, and Docker, we keep things reproducible, portable, and deployable on various ML-Ops platforms. The takeaway? With the right tools, teams can mix and match languages, leveraging the best in each, and still build solid, scalable solutions.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Claudia Penaloza (Continental Tires) Keyword(s): multi-lingual, r, python, docker, mlops Video recording available after conference: ✅ |
Claudia Penaloza (Continental Tires) |
| 10:30–12:00 | Penn Garden | Standardizing Institutional Research Operations Using RMore infoIn this presentation, we will demonstrate how a small Institutional Research (IR) team can leverage R and Quarto to consolidate and streamline its analytics platforms and reporting tools. We will highlight the versatility of R and Quarto in institutional research by showcasing their use in producing operational manuals, presentation slide decks, BI dashboards, internal and external reports, and institution-wide parameterized reports.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Chris Kao (Flagler College) Keyword(s): institutional research Video recording available after conference: ✅ |
Chris Kao (Flagler College) |
| Shiny | |||
| 10:30–12:00 | Gross 270 | AI Execution Capability Assessment: A Shiny Web App for AI Strategy and GovernanceMore infoAs organizations adopt AI, assessing readiness, maturity, and governance is critical. This talk introduces a Shiny-based AI capability assessment tool that integrates R, Python (via reticulate), FAISS, and OpenAI APIs to evaluate AI execution across 15 key areas. The app enables organizations to: Conduct structured self-assessments on AI capabilities. Use FAISS-enhanced retrieval-augmented generation (RAG) to retrieve insights from NIST AI RMF and ISO/IEC 42001. Generate customized LLM-based recommendations, gap analysis reports, and automated project proposals. Visualize AI maturity through interactive radar charts and data-driven priority rankings. Built with Shiny, shinydashboard, DT, plotly, and OpenAI, this tool streamlines AI governance, helping stakeholders align strategy with regulatory and operational needs. This talk will showcase the app’s development, challenges in integrating R with LLMs, and its real-world impact on AI strategy execution.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Akbar Akbari Esfahani (Central California Alliance for Health) Keyword(s): shiny, python, ai governance, generativeai, rag (retrieval-augmented generation). Video recording available after conference: ✅ |
Akbar Akbari Esfahani (Central California Alliance for Health) |
| 10:30–12:00 | Gross 270 | Every Eclipse Visible from Your Current Location for the Rest of Your Life (with shiny)More infoA shiny web app that calculates all solar and lunar eclipses for up to the next 75 years visible at your current lon/lat coordinate location ([link][1]). The shiny app leverages the swephR High Precision Swiss Ephemeris package for celestial body calculations. Anecdote ======== I witnessed the sky darken for no apparent reason in the fall of 2023. Months later I realized I was witnessing the October 2023 annular solar eclipse. So I began to wonder: How many other eclipse events have I been missing? Broad Topics Covered ==================== Julius Caesar vs Pope Gregory XIII: The Battle for Space-Time ------------------------------------------------------------ While most of the western world adopted the Gregorian calendar (365.2425 days long) by the 20th century for agricultural and cultural reasons, astronomers track time off-earth using the older Julian calendar (365.25 days long). We will briefly touch on why this is, how to convert back-and-forth between Gregorian and Julian to perform astronomical calculations, and other notable phenomena to keep in mind when using R in space. User Input: Flexibility vs Ease-of-Use ====================================== Specifically with regards to earthly longitude/latitude coordinate inputs (required for calculating the alignment of celestial bodies to a Earth-bound viewer), a decision had to be made affecting usability of the shiny app for the end-user based on their ability to easily enter an input location: The Breadth of the swephR Package ================================= The swephR package is useful for calculating not only solar and lunar eclipse events visible from earth but all kids of celestial alignments including: - Planetary positions - The crossing of planets over positions - Fixed star positions - Orbital periods for the Earth, asteroids, etc Biography --------- Tim Bender, Hobbyist Bachelor of Urban Planning from the University of Cincinnati. ~15 years local government experience as an urban planner and transit planner. Tim was part of a team that helped deploy Google Transit for his transit agency in Kentucky in 2008, being among the first 50-ish agencies worldwide to go live. Journey with R began with a desire to log transit vehicle real-time location data from an API for analysis but I had no programming experience or knowledge of how to approach the problem. I wouldn’t successfully solve this problem until after about 5 years of self-guided learning. [linkedin][2] [github][3] [1]: https://tim-bender.shinyapps.io/shiny_all_eclipses/ [2]: https://www.linkedin.com/in/tim-bender-238870171/ [3]: https://github.com/benda18Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Timothy Bender Keyword(s): shiny, astronomy, leaflet, geocoding, communication Video recording available after conference: ✅ |
Timothy Bender |
| 10:30–12:00 | Gross 270 | Extending Shiny Dashboards to Mobile with Ionic and Rust: A Cross-Platform ApproachMore infoShiny has long been a framework of choice for interactivity dashboards on the web, but what if you need a mobile variant? This talk proposes new ways to extend existing Shiny dashboards to mobile dashboards by using Ionic and JavaScript, while making use of a Rust server for data manipulation and preparation. We'll analyze the integration of Shiny with a mobile frontend Shiny, the benefits offered by Rust in terms of backend efficiency, and the tradeoffs offered between web-based and mobile dashboard interfaces. Finally, we'll assemble everything with a live working proof of concept mobile application that streams a Shiny dashboard into a simplified and responsive user interface.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): Anastasiia Kostiv Keyword(s): ionic shiny rust react.js Video recording available after conference: ✅ |
Anastasiia Kostiv |
| 10:30–12:00 | Gross 270 | paleopal: a highly modular and interactive Shiny app for building reproducible data science workflows in paleontologyMore infoThe field of computational paleontology is rapidly advancing with many recently developed open-source R packages leading the charge for more standardized, reproducible, and open research. This push is a relief for many data-science-minded paleontologists who have previously toiled over writing their own scripts to download, clean, analyze, and visualize their data. Many of these steps are now covered by functions in these new packages (and those of other packages in the R universe). However, this push for more script-based research may introduce a wrench in the existing scientific workflows of less technical researchers who lack a background in coding or cause a greater learning curve for new researchers introduced to the field. Therefore, bridging the gap between visual, hands-on workflows and digital, code-based workflows is imperative to the collaborative future of computational paleontology. Here I present a new Shiny app, paleopal, that provides a user-friendly interface to build paleontological data science workflows. The app connects existing paleontological R packages such as palaeoverse and deeptime with the tidyverse suite of R package to encourage standardized scientific pipelines. Furthermore, the app uses the shinymeta R package to provide a live code and results panel and a downloadable RMarkdown script for the pipeline. Altogether, paleopal aims to spearhead the next generation of training of computational paleontologists, regardless of age, background, or technical expertise. Further, the modular nature of the app introduces an avenue for other fields to fork and adapt the project for their own needs.Date and time: Sun, Aug 10, 2025 - 10:30–12:00 Author(s): William Gearty (Syracuse University) Keyword(s): shiny, workflow, earth sciences, biological sciences, reproducible research Video recording available after conference: ✅ |
William Gearty (Syracuse University) |
| Quarto | |||
| 13:00–14:30 | Penn 1 | Final FlourishesMore infoQuarto’s clean and simple API is one of its greatest strengths. This ease of use extends to creating Quarto extensions as well. In this talk, we will document the details and process we took to create our Quarto extension, Flourish. Flourish allows users to dynamically target text in code chunks, like functions or parameters, and apply styling (e.g. highlighting) in the rendered document. If you are familiar with the R package Flair, Flourish works similarly, but is language agnostic. We will speak in depth about how Flourish dynamically injects styling into a rendered report using Javascript, and the work it took to get to this process. This talk seeks to inform participants of our extension and its technical workings, as well as step them through our process of creating a Quarto extension, showing how simple it can be to extend Quarto’s functionality.Date and time: Sun, Aug 10, 2025 - 13:00–14:30 Author(s): Visruth Srimath Kandali (Cal Poly -- San Luis Obispo); Kelly Bodwin (California Polytechnic State University) Keyword(s): quarto, extension development, pedagogy, formatting, technical Video recording available after conference: ✅ |
Visruth Srimath Kandali (Cal Poly -- San Luis Obispo) |
| 13:00–14:30 | Penn 1 | From Frustration to Function: Tackling the Challenges of New Tech Adoption "Cracking the Code: Overcoming Early Hurdles with New Open-Source Tools"More infoAdopting new open-source technology can be both exciting and challenging. While a tool may appear promising and seem like the perfect fit for a specific task, early-stage technologies often come with their own set of hurdles. One of the biggest challenges is the lack of comprehensive resources—such as detailed documentation, practical examples, active discussion boards, and community support—which can make it difficult to troubleshoot issues or fully understand the tool’s capabilities. This often requires additional effort, experimentation, and problem-solving to get things working as intended. At last year’s Posit Conference, I was introduced to a new tool called closeread, a Quarto extension designed for vertical scrollytelling. The concept immediately caught my interest because it seemed like an innovative way to enhance storytelling with data. Motivated by its potential, I decided to give it a try shortly after the conference. However, my initial experience was challenging. The tool was only partially functional, and when I encountered technical issues, I struggled to find enough supporting resources to resolve them. The available documentation was limited, examples were scarce, and there wasn’t much discussion happening in community forums. Frustrated by these obstacles, I eventually set the tool aside, unsure of how to move forward. Some time later, I came across a user-contribution contest that reignited my interest in closeread. This contest motivated me to tackle the tool again, but this time with a different mindset. Instead of relying solely on available resources, I approached the problem more systematically—digging into the code, experimenting with different configurations, and learning through trial and error. This hands-on approach, combined with the fresh motivation from the contest, helped me overcome the technical challenges I had faced earlier. Eventually, I was able to get the tool working successfully, and in the process, I gained a deeper understanding of how to navigate the common pitfalls associated with adopting new technology. In my talk, I will share this learning journey in detail, highlighting the strategies that helped me move from frustration to success. I’ll discuss practical approaches for overcoming deployment challenges, including how to troubleshoot effectively when documentation is limited, how to leverage community resources even when they seem sparse, and how to maintain motivation when progress feels slow. I’ll also offer insights into the mindset shifts that can make a big difference—like viewing challenges as opportunities to deepen your technical skills rather than as roadblocks.Date and time: Sun, Aug 10, 2025 - 13:00–14:30 Author(s): Dror Berel Keyword(s): quarto, closeread, best practices, community, scrollytelling Video recording available after conference: ✅ |
Dror Berel |
| 13:00–14:30 | Penn 1 | Parsing Quarto and R Markdown documents in RMore infoIn this talk, we will share recent work on parsermd R package and its use for the programmatic manipulation of Quarto and R Markdown documents. The talk will include a brief overview of the underlying technical details of the parser and abstract syntax tree (AST) representation of these documents in R. Additionally, we will present work to support a number of use cases for these tools to solve practical problems. Examples include building documents that have multiple output variants (e.g. assignments with or without solutions included) and utilities built to support grading and feedback for assignments based on these document formats. Finally, we will discuss our future development plans for the package and related tools.Date and time: Sun, Aug 10, 2025 - 13:00–14:30 Author(s): Colin Rundel (Duke University) Keyword(s): quarto,rmarkdown,literate programming,automation Video recording available after conference: ✅ |
Colin Rundel (Duke University) |
| 13:00–14:30 | Penn 1 | Reproducible pedagogy with R and QuartoMore infoReproducible teaching materials (1) help students understand the usefulness of reproducibility, (2) facilitate pedagogy around meaningful, data-driven statistical applications, and (3) allow other instructors to iterate on course materials. In this talk, I show how a variety of R packages and Quarto work together to produce useful, simple and aesthetic reproducible course materials. I discuss how to weave R-code into lectures, embed reproducible visualizations and animations, and easily host data sets as well as downloadable R scripts on a public course website developed with Quarto. I showcase an undergraduate Bayesian statistics course taught at Duke University in Spring 2025 as an example of reproducible statistics pedagogy and provide tutorial-like instructions to adapt the methods to other courses.Date and time: Sun, Aug 10, 2025 - 13:00–14:30 Author(s): Alexander Fisher (Duke University) Keyword(s): teaching, statistics, website, reproducible Video recording available after conference: ✅ |
Alexander Fisher (Duke University) |
| Productivity boosters | |||
| 13:00–14:30 | Penn 2 | Air - A blazingly fast R code formatterMore infoIn Python, Rust, Go, and many other languages, code formatters are widely loved. They run on every save, on every pull request, and in git pre-commit hooks to ensure code consistently looks its best at all times. In this talk, you'll learn about Air, a new R code formatter. Air is extremely fast, capable of formatting individual files so fast that you'll question if its even running, and of formatting entire projects in under a second. Air integrates directly with your favorite IDEs, like Positron, RStudio, and VS Code, and is available on the command line, making it easy to standardize on one tool even for teams using various IDEs. Once you start using Air, you'll never worry about code style ever again!Date and time: Sun, Aug 10, 2025 - 13:00–14:30 Author(s): Lionel Henry (Posit PBC), Davis Vaughan (Posit PBC) Keyword(s): formatter, rust Video recording available after conference: ✅ |
Lionel Henry (Posit PBC) Davis Vaughan (Posit PBC) |
| 13:00–14:30 | Penn 2 | Getting Things LoggedMore infologger is a lightweight, modern, and flexible logging utility for R, with a clear concept and separation of log message formatter, layout render, and log record appender functions -- which makes it effortless to log messages in various formats and destinations, such as your console, files, databases, or even Slack. The package was first released 6 years ago, and it's widely used since then. The development recently spiked thanks to generous contributions from the community (over 100 pull requests), introducing improved async logger; new formatter, helper, and appender functions; documentation updates; and a hex logo!Date and time: Sun, Aug 10, 2025 - 13:00–14:30 Author(s): Gergely Daroczi Keyword(s): logging Video recording available after conference: ✅ |
Gergely Daroczi |
| 13:00–14:30 | Penn 2 | R4R: Reproducibility for RMore infoCreating a reproducible environment for data analysis pipelines is challenging, due to the wide range of dependencies involved—from data inputs and external tools to system libraries and R packages. Although various tools exist to simplify the process, they often focus exclusively on R package dependencies and omit the system ones, rely on user-supplied metadata, or create an unnecessarily large environment. We presentr4r, a tool that automatically traces all dependencies in a pipeline using system call interception. Based on these traces, r4r generates a Docker image containing precisely the dependencies needed for reproducible execution. We demonstrate its effectiveness on a collection of R Markdown notebooks from Kaggle, illustrating how r4r helps ensure fully reproducible workflows.Date and time: Sun, Aug 10, 2025 - 13:00–14:30 Author(s): Pierre Donat-Bouillud (Czech Technical University), Filip Křikava (Czech Technical University) Keyword(s): reproducibility, docker, automation Video recording available after conference: ✅ |
Pierre Donat-Bouillud (Czech Technical University) Filip Křikava (Czech Technical University) |
| 13:00–14:30 | Penn 2 | wizrd: Programming with LLMs in RMore infoLarge Language Models (LLMs) offer new opportunities for accelerating data analysis, offering flexibility through a natural language interface. The wizrd package was born for testing the hypothesis that LLMs can be used to implement functions within larger data science tools and workflows. With wizrd, users can parameterize model inputs, implement logic that delegates to R functions, and constrain outputs to specific R data structures. Finally, configured models can be converted into actual R functions, which can in turn serve as tools for other models, forming a graph of agents. LLM-based programs can be imported from the Langsmith hub, realizing their portability across languages. We will give an overview and demo of the package.Date and time: Sun, Aug 10, 2025 - 13:00–14:30 Author(s): Michael Lawrence Keyword(s): llm, ai, programming Video recording available after conference: ✅ |
Michael Lawrence |
| Too big to fail | |||
| 13:00–14:30 | Penn Garden | Futureverse P2P: Peer-to-Peer Parallelization in R using FutureverseMore info# TL;DR In this presentation, I will show how you can move from running your Futureverse R code in parallel on your local computer to a distributed peer-to-peer (P2P) network of computers shared among friends - all with a single change of settings. Any user with R installed can contribute their compute power when idle and harness others when needed. # Abstract The Futureverse framework revolutionized parallel computing in R by providing a simple, unified API for parallel evaluation of R code. At its core, the future package allows developers to write code once (e.g.f <- future(lm(x, y))), which then may be evaluated on any future-compatible parallel backend. For instance, plan(multisession) parallelizes on the local machine, plan(cluster) and plan(mirai_cluster) can parallelize on a set of local or remote machines, and plan(batchtools_slurm) distributes the computations on a Slurm high-performance compute (HPC) cluster, and so on. Regardless of backend used, getting the value of the future is always the same (e.g. v <- value(f)). In this presentation, I introduce a novel peer-to-peer (P2P) future backend that enables distributed parallel computing among friends and colleagues using shared storage. The shared storage can be a local file system or a cloud storage. I will illustrate this concept using the plan(p2p_gdrive) backend, which leverages Google Drive as a communication medium, where users can offload computational tasks to peers in a shared workspace. When a user creates a future, it ends up in the “todo” folder, where idle peers can detect it, download it, and execute it locally. Once completed, the result is uploaded to the “done” folder, making them available for retrieval by the original user. The Futureverse ecosystem has a simple API, which makes it is easy for anyone to write parallel R code. Because Futureverse is exceedingly well-tested, you can easily and safely scale up code currently running on your local computer to run on distributed P2P clusters. This approach democratizes distributed computing, allowing R users to harness the collective power of their social network without requiring dedicated HPC infrastructure. Come to my talk and learn how you and your friends can get together and share your compute resources, allowing you to run y <- future_lapply(X, my_long_running_analysis) across their computers, while you share your idle compute resources with them. All this is available from your local R prompt.Date and time: Sun, Aug 10, 2025 - 13:00–14:30 Author(s): Henrik Bengtsson (University of California San Francisco (UCSF)) Keyword(s): programming, parallel processing, performance, reproducibility Video recording available after conference: ✅ |
Henrik Bengtsson (University of California San Francisco (UCSF)) |
| 13:00–14:30 | Penn Garden | Outgrowing your laptop with PositronMore infoHave you ever run out of memory or time when tidying data, making a visualization, or training a model? An R user may find their laptop more than sufficient to start their journey with statistical computing, but as datasets grow in size and complexity, so does the necessity for more sophisticated tooling. This talk will step through a set of approaches to scale your tasks beyond in-memory analysis on your local machine, using the Postron IDE: adopting a lazy evaluation engine like DuckDB, connecting to remote databases with fluent workflows, and even migrating from desktop analysis entirely to server or cloud compute using SSH tunnelling. The transition away from a local, in-memory programming paradigm can be challenging for R users, who may not have much exposure to tools or training for these ways of working. This talk will explore available options which make crossing this boundary more approachable, and how they can be used with an advanced development environment suited for statistical computing with R. Integrations in the Positron IDE make all these tasks easier; for example, remote development in Positron allows an R user to seamlessly write code on their local machine, and execute that code on a remote host without tedious interactions outside the IDE. Whether you train statistical models, build interactive apps, or work with large datasets, after this talk you’ll walk away with techniques for doing it better with Positron.Date and time: Sun, Aug 10, 2025 - 13:00–14:30 Author(s): Julia Silge (Posit PBC) Keyword(s): ide, workflow, tooling, remote development Video recording available after conference: ✅ |
Julia Silge (Posit PBC) |
| 13:00–14:30 | Penn Garden | Scaling Up Data Workflows with Arrow, Parquet, and DuckDBMore infoWhile R is an expressive language for exploring and manipulating data, it is not naturally suited to working with datasets that are larger than can fit into memory. However, modern tooling, including Parquet files, the Arrow format, and query engines like DuckDB, can expand what is possible to do with large datasets in R. Using practical examples, this talk will introduce several packages that bring these tools into R with intuitive interfaces, and demonstrate how to adopt them to work efficiently with large datasets. It will also show how these tools unlock new opportunities, such as easy access to data in cloud storage, and will explore recent developments in the Arrow and Parquet ecosystems, including for geospatial data.Date and time: Sun, Aug 10, 2025 - 13:00–14:30 Author(s): Neal Richardson (Posit PBC) Keyword(s): arrow, parquet, duckdb, big data Video recording available after conference: ✅ |
Neal Richardson (Posit PBC) |
| 13:00–14:30 | Penn Garden | sparsity support in tidymodels, faster and less memory hungry modelsMore infoSparse data, data with a lot of 0s, appear quite often in modeling contexts. However, existing data structures such as data.frames or matrices doesn't have a good way of controlling them. You were forced to represent all data as sparse or dense (non-sparse). This means that many modelings workflows uses an non-optimal data structure, that at best slows down computation and at worst won't be computational feasible. This talk will cover how we overcame these issues in tidymodels. Starting with the creation of a sparse vector format that fit in tibbles followed by the wiring needed to make it happen in our packages. The best part is that most users doesn't need to change anything in their code to benefit from these speed improvements.Date and time: Sun, Aug 10, 2025 - 13:00–14:30 Author(s): Emil Hvitfeldt (Posit PBC) Keyword(s): machine learning, tidymodels, sparse data Video recording available after conference: ✅ |
Emil Hvitfeldt (Posit PBC) |
Footnotes
Source code for the R version at https://gitlab.com/rconf/user-2025-website.↩︎