Computational Resources
Introduction
The UvA is fully committed to the principles of Open Science, emphasizing transparency and reproducibility in research. Open Science involves making scientific research freely accessible to everyone in society, ensuring research results are widely available. Open Science follows the FAIR (findable, accessible, interoperable, reusable) principles (FAIR checklist). To ensure Open Science, for every project, whether it is a grant, a publication or a student project, the following general four guiding rules hold. It is crucial to:
1. | Develop a data management plan at the beginning of your project |
2. | Maintain good data management practices during the project |
3. | Create a well-organized, self-contained archive at the project's conclusion |
4. | Share it with the scientific community in a data repository |
To further support good data management practices, IBED provides a Data Management Bonus of 1000 euro to PhDs and Postdocs who succeed, by the end of their project, to deliver such a well-organized and self-contained archive. To support research at scale, this page describes computational resources available for IBED. The document provides both guidelines and resources (Storage, Archiving, HPC). We conclude with a list of resources for courses and trainings offered by UvA, including good data management and computational practices. This document is tailored to IBED, but for more generic information on available resources at faculty level check Data Analytics and Statistics Hub (DASH) and the UvA RDM website.
Data Management Plan
For almost all funding agencies a Data Management Plan (DMP) is a requirement. To prepare your DMP it is recommended to create an account at DMPonline. Most likely the template of your grant (e.g. NWO) will be available within the platform. Alternatively, you can use a template of the funding organization. Prepare your DMP and invite IBED's data manager (Johannes De Groeve) for review to check if the DMP is in line with RDM policy at UvA, FNWI and IBED. For examples of successful DMPs within IBED, please check here.
Typical questions and topics which are addressed in a DMP include:
- How will data be collected?
- How will data be stored and backed up?
- How will data be processed and analyzed?
- How will data be shared and disseminated?
- How will data be preserved?
- Which data formats and standards?
- Which metadata standards are used?
- Version control procedures?
- Is there a budget required for data management activities?
- Who is responsible for which data management tasks?
- Who will access the data and security procedures (e.g. personal data)?
Answering these questions will streamline decisions about which computational resources you require during the project, including which type(s) of institutional storage are most applicable and which High Performance Computational needs there are. Moreover, it will help you to create a well defined project directory and dataset structure. For instance, by identifying a commonly used domain specific metadata standard (e.g. Darwin Core Standard for occurrence data) it might be easier to assess which essential variables are needed during data collection and in which format/unit these should be recorded. Hence, it can also help you with defining your sampling design. DMP's are often only created when obliged by funders. However, whether you are a student or a PI, it is always recommended to think in advance about the above questions. This will help with:
- Working in a structured manner
- avoid having all your files and data scattered without naming conventions or file organisation
- Prevent you from losing time
- you lost track of what is the right dataset or code
- you forgot the meaning of your columns
- you don't remember the analysis workflow
- Facilitate how you share your research
- with your PI
- with external collaborators
- upon publication
Project Setup
A common issue that most students and PI's encounter is how to keep their project organized. The starting point to keep your project organized is to define a project directory structure for each individual project unit (e.g. a manuscript, Bsc/Msc thesis). The directory structure and what can be defined as a project unit, is based on knowledge that you gained from your project proposal's roadmap and DMP. While project units can change over time it is good to have an initial starting point of basic organisation. Here we provide some basic examples. To set-up basic R-project please check the following tutorial [TO BE CREATED].
Basic project directory structure
Example 1
.
├── README.md
├── code
│ ├── 0_data_preparation.Rmd
│ └── 1_analysis.Rmd
├── data
│ ├── input
│ └── output
├── docs
│ ├── manuscript.docx
│ ├── figs
│ └── tabs
└── project_name.Rproj
Example 2
.
├── README.md
├── code
│ ├── 0_data_preparation.Rmd
│ └── 1_analysis.Rmd
├── data
│ ├── raw
│ └── processed
├── docs
│ ├── manuscript.docx
│ ├── figs
│ └── tabs
└── project_name.Rproj
Data Storage
To be in line with institutional policy, prior to publication, researchers need to ensure research data is accessible through an institutional cloud storage solution. Depending from the needs including size, speed, resources, programmatically accessibility, etc. different storage solutions are offered to IBED staff and students. Here we list all the storage resources which can be used for ongoing research.
N.B.: SURF Filesender is a tool for sending large files, but it is not a storage service.
OneDrive/SharePoint
- When to use: Personal file storage in Microsoft 365 cloud.
- Access: Through browser or Windows File Explorer on UvA computers.
- Features: Backup files, 1 TB limit.
- Cost: FREE
Surf Drive
- When to use: Personal file storage. Everyone by default has access to 1TB. Important: when students finish their study, their account and files disappear, even when they have been shared with their supervisor. The files are extremely difficult to recover, so please plan accordingly.
- Features: Store and share files/data with internal/external collaborators, (programmatically) accessible from computing resources.
- Access: https://surfdrive.surf.nl
- Cost: FREE
- Documentation:
Research Drive
- When to use: Project-specific research data, ideal for collaboration provided by uva IT services.
- Features: Store and share files/data with internal/external collaborators, (programmatically) accessible from computing resources.
- Access: Request via serviceportal UvA selfservice portal by PIs or coordinators. https://uva.data.surfsara.nl.
- Cost: FREE
- Documentation:
Tape SURF
- When to use: Long-term storage for infrequently accessed data (project-paid).
- Access: SURF service desk.
- Features: Long-term data storage in compressed units, 10 TB for 5 years at 150 euros per year.
- Cost: 150 euros per year (750 euros for 5 years)
- Documentation: More Info
Tape IBED
- When to use: Long-term storage for infrequently accessed data (non-project).
- Access: Request via Computational Support IBED (see IBED tape archive usage and policy)
- Features: Long-term data storage in compressed units.
- Cost: 5TB FREE
- Documentation: More Info
Faculty Storage (FEIOG)
- When to use: Faculty storage is a service aimed at cost-efficient storage of large datasets
- Access: Request via feiog@uva.nl
- Features: Fast access storage for large datasets
- Cost: This service paid from faculty central budget and is offered on a fair use policy. This means that for reasonable requests you will not be billed. You are encouraged to include storage costs when you are externally funded though. For this purpose, the price of the Faculty Storage service is €130,- per TB per year.
- Documentation: More Info
Data Publication
At closure of a project (student, publication) data and code are required to be published in a data repository following the Open Science and FAIR principles. Many general purpose and domain-specific repositories exist (see below). Where possible, it is encouraged to use domain-specific repositories. For instance, when working with sequencing data it is highly recommended to submit new sequences to NCBI following their metadata standards, when working with occurrence data (derived from e.g. field sampling, camera traps, DNA barcodes, etc.) it is encouraged to publish new records in GBIF using the Darwin Core Standard. However, not all types of datasets fit within a predefined standard, therefor general purpose repositories are also important.
An important general purpose repository is Figshare, for which UvA provides an UvA/AUAS institutional account for every staff member. Note that BSc and MSc students do not have access to the institutional Figshare. BSc and MSc students can create accounts and upload files up to 5GB for open data and materials. However, we highly recommend researchers to invite students by creating a "project" via their institutional account. Through this venue the research data is accessible to the PI in a public repository owned by UvA and there are no storage limitations.
Figshare
- When to use: Archive and publish data attached to a publication or PhD thesis.
- Access: Procedure.
- Features: Make data linked to a publication available online, generate doi, add metadata.
- Cost: FREE
- Documentation: Figshare Help, How to upload and publish data
(Domain-Specific) Repositories
- Zenodo: General purpose, 50GB limit, offers options for single or multiple file uploads.
- Datadryad: General purpose, 300GB limit, $120USD base price.
- OSF: Open Science Foundation. General purpose.
- GBIF: Biodiversity data.
- NCBI: For biomedical and genomic information.
- Pangaea: Earth science data, free but asks for a contribution.
- Movebank: Movement data.
- Paleobiodb: Paleo data.
- TRY-db: Plant traits.
- re3data: Search all available scientific repositories.
High Performance Computing
There are several options available to IBED staff and students who need extra computing capacity. We distinguish two Virtual Research Environments (VRE, Research Cloud) and three clusters (Crunchomics, Snellius, Spider / Grid). See the full list below.
Before applying for HPC resources, consider the following questions:
- Do you know how to use HPC systems?
- Do you need GPU or CPU (or both)?
- Do you know exactly what you need?
- How quickly do you need to get started and for how long?
- Anything specific regarding large datasets?
- Does it have to be completely free, or do you have project money?
The Computational Support Team can help people decide which option to use. The student or staff member should then contact the service's help desks to get access and troubleshoot specific problems.
Virtual Research Environment
- What is it: Cloud-based, lightweight work environment provided by UvA IT-services.
- Features:
- Fully customizable in terms of computing power, storage capacity, and tools.
- Suitable for maintaining a service or tool.
- Runs on Linux or Windows on Microsoft Azure, and may later be available on SURF Cloud and Amazon AWS.
- Access: Request via serviceportal UvA selfservice portal by PIs or coordinators.
- Cost: Probably free for small projects; may need to pay for large projects.
- Documentation: more info
Research Cloud
- What is it: SURF collaborative environment and portal to different cloud providers.
- Features:
- Everything done through a workspace.
- Can connect to data on Research Drive.
- Access: Contact SURF support desk or send an email to servicedesk@surf.nl.
- Cost: Pay with project budget if possible. E-infra grants available for students.
- Documentation: more info
Crunchomics
- What is it: FNWI (FEOIG) service for running heavy calculations.
- Features:
- CPUs, GPUs, large memory nodes.
- Group creation and collaboration.
- Access: Email Wim de Leeuw with your uva-net ID.
- Cost: Available and free for IBED and SILS.
- Documentation: more info
Snellius
- What is it: SURF service providing access to Dutch national cluster supercomputer.
- Features: CPUs, GPUs, large symmetric multi-processing nodes.
- Access: Through SURF request portal.
- Cost: Researchers can apply for computing time, data services, and support. Large scale research projects need to apply via NWO.
- Documentation: more info and tutorial GPU access
Spider and the Grid
- What is it: Data processing platforms at SURF for highly parallel jobs on distributed resources.
- Features: Suitable for large, structured datasets.
- Access: Contact SURF support desk or send an email to servicedesk@surf.nl.
- Documentation: more info
Obtain Computational Facilities
When planning for computational resources within the Institute for Biodiversity and Ecosystem Dynamics (IBED), it's essential to understand the various funding levels and associated resources available through the institute, the Faculty of Science (FNWI), and the Dutch Research Council (NWO).
Please note that, except for Crunchomics, IBED's cluster for CPU-based computation, these resources are managed by SURF, and the amount of computational power or storage granted can vary over time. Examples of these facilities include the Snellius national supercomputer, primarily used for GPU computation; the SURF Research Cqloud, a platform designed to facilitate the creation and management of virtual research environments; and the SURF Research Drive for storage. For more information, refer to the SURF User Knowledge Base.
Small Compute via FNWI Faculty and NWO
These resources provide access to SURF services and can typically be granted within a day or two. For example, GPU allocations are available approximately three times a year, each providing around 50,000 to 100,000 Service Billing Units (SBUs). Researchers can apply for these resources through the SURF Service Desk by creating a ticket at https://servicedesk.surf.nl and selecting "Apply for access / Direct institute contract" or "Apply for access / Small Compute applications (NWO)."
For Virtual Research Environment and Research Cloud access within FNWI faculty, please refer to their specific section.
NWO Large Compute
For extensive computational needs, researchers can apply for large compute grants via NWO. Detailed information and eligibility criteria are available on the NWO website. Please note that this procedure takes time.
NWO Research Capacity Computing Services (RCCS)
The Research Capacity Computing Service (RCCS) is a contractual framework offered by SURF, the collaborative organization for ICT in Dutch education and research. RCCS enables institutions to purchase flexible computing capacity using credits, granting their researchers direct access to SURF's national computing facilities, including the National Supercomputer Snellius and the SURF Research Cloud. To find the latest rates for services, search for "SURF Services and Rates" on Google. For more details, refer to Direct access to compute services, Snellius RCCS contract information, and SURF Research Cloud.
Code Versioning
Imagine you have a piece of code, and you're keen on tracking its changes without losing the original version. The conventional method involves saving scripts as new files, often labeled with indicators like 'v0' or a timestamp. Git offers a more seamless way to version your code without the hassle of managing different version files manually. It not only tracks changes made to your files but also equips you with tools to document those changes. While Git's initial development focused on code versioning, it's versatile enough to handle versioning of smaller datasets. GitHub and GitLab support various text file formats (e.g., csv, fasta), making them ideal for versioning.
Bioinformatics
Bioinformatics focuses on collecting, analyzing, and visualizing biological data, often using sequencing data. This is done via computational tools and methods that allow to make sense of vast amounts of data from genomic, proteomic, or transcriptomic studies. If you are working with bioinformatic data, check out the resources below to help you get started.
Getting started
- If you are new to sequence data analysis and/or using a High-Performance Computer (HPC)? Start with this tutorial to learn the basics.
- Need guidance on documenting your bioinformatics work? Check out this resource for tips on effective documentation.
- Looking for software and workflows? Our bioinformatics guidance page provides tutorials to get you started, examples for bioinformatic workflows and a list of tools and databases to get you started.
Geoinformatics and GIS
Geoinformatics is about collecting, analyzing, and visualizing geographic information. A GIS (Geographic Information System) is a system that stores geospatial data, uses special tools to analyze spatial data, and creates maps and 3D models to represent spatial data visually. If your project has a spatial component, you will probably benefit from learning more about GIS!
Software
A list of some commonly used GIS software and tools:
-
ESRI ArcGIS: This is one of the most user-friendly programs for making maps and analyzing GIS data, but it is not open source. It has robust and extensive analysis tools, along with excellent technical support, documentation, and training resources. It is the best tool for creating spatial data and complex, beautiful maps. Note: UvA holds the license for the ArcGIS software suite, so UvA staff and students have access to the software and ArcGIS Online for free for as long as you are at UvA. Visit the software page on the GIS-Studio website for instructions on how to download the software and get access to ArcGIS Online.
-
QGIS (Quantum GIS): Open source, free GIS software that has a wide range of GIS functionalities and is highly customizable with Python scripting.
-
Google Earth Engine: A cloud-based tool for analyzing and visualizing large-scale datasets. The platform also has a huge catalog of satellite imagery and other datasets, and extensive guidance and tutorials. Free for research and education (for now!).
-
R/RStudio: Open source and free programming language and environment that has many packages designed for working with spatial data. The main advantage of using R is that your analyses will be far more reproducible and shareable.
IBED GIS-Studio
IBED has a GIS-Studio which helps staff and students with spatial data analysis using Geographical Information Systems (GIS) and remote sensing (RS) software. Visit the GIS-Studio website for more on how to reserve a computer and borrow hardware such as field tablets, drones, and laser scanners. There are also many examples of finished projects and ongoing research.
Geospatial Data
So much data, so little time! Here is a short list of our favorites that are useful for many applications:
- ArcGIS Living Atlas: You can bring datasets directly into ArcGIS maps.
- Top10NL: A large and detailed dataset with point, polygon, and line layers for almost everything visible in the Netherlands—buildings, water bodies, terrain.
- AHN (Actueel Hoogtebestand Nederland): A national elevation dataset providing height information for the entire country as both a digital terrain model (ground surface only) and a digital elevation model (including buildings, trees, etc.), often with a resolution of 0.5 meters or better, depending on the specific version.
- PDOK (Publieke Dienstverlening Op de Kaart): A central resource for geospatial information in the Netherlands managed by the Dutch government.
- European Environment Agency SDI Geospatial Data Catalogue
- NASA Earth Data Catalog
GIS & Spatial Data Training and More Info
- ArcGIS:
-
R/RStudio:
-
Python:
Artificial Intelligence (AI)
If you have data that contains patterns or regularities, it may be suitable for AI applications. AI techniques can uncover patterns and automate tasks across various research fields. If you're unsure whether your data or project is suitable for AI, please contact the computation support team or me. You can also checkout my tutorials in here.
Examples of AI applications are abundant. ChatGPT, for instance, is a language model capable of generating human-like text. AlphaFold, the Nobel Prize-winning application developed by DeepMind, predicts protein 3D structures from amino acid sequences, significantly advancing bioinformatics. Self-driving cars utilize AI for object detection, navigation, and planning. In medicine, AI aids in detecting cancerous tissues.
In our department, I have worked on classifying bird behavior from GPS and angular acceleration time-series data, as well as detecting and tracking individual fish in videos.
Feel free to reach out to discuss how AI can enhance your research.
Courses at UvA
Name | URL |
---|---|
Data Science Centre | https://dsc.uva.nl |
UvA Library | https://uba.uva.nl |
eScience Center | https://www.esciencecenter.nl |
SURF | https://www.surf.nl |
CS Tutorials
Name | URL |
---|---|
Accessing GPUs | Website |
R-packaging | Website |
Git basics | Website |
Python basics | Website |
Useful Links
Service Desk
Name | URL |
---|---|
UvA Service Desk | serviceportal.uva.nl |
SURF Service Desk | servicedesk.surf.nl |
FNWI/FEIOG Service Desk | email to feiog@uva.nl |
Services Direct Links
Name | URL |
---|---|
SURF Filesender | filesender.surf.nl |
SURF Drive | surfdrive.surf.nl |
SURF Research Cloud | sram.surf.nl |
SURF Research Drive | researchdrive.surfsara.nl |
UvA Research Drive | uva.data.surfsara.nl |
Please submit an issue for improvements to the documentation, or contact the Computational Support Team of IBED.