Storage Options: Difference between revisions

From RCSWiki
Jump to navigation Jump to search
No edit summary
 
(49 intermediate revisions by 5 users not shown)
Line 1: Line 1:
There are a few options researchers can take advantage of when storing their research data.  
There are a few options researchers can take advantage of when storing their research data. Please take into account the purpose of the storage, appropriate research data management principles, and the data classification when choosing an appropriate storage solution.  


== Data Classification ==
= Data Classification =
Please review the different data classifications that are outlined by the [https://ucalgary.ca/legal-services/sites/default/files/teams/1/Standards-Legal-Information-Security-Classification-Standard.pdf Information Security Classification Standard]. There are 4 levels of data classification which are summarized in the table below.
Please review the different data classifications that are outlined by the [https://ucalgary.ca/legal-services/sites/default/files/teams/1/Standards-Legal-Information-Security-Classification-Standard.pdf Information Security Classification Standard]. There are 4 levels of data classification which are summarised in the table below.


{| class="wikitable"
{| class="wikitable"
Line 44: Line 44:
: https://ucalgary.service-now.com/it?id=it_catalog_by_category&sys_id=4dbb82ee13661200c524fc04e144b044
: https://ucalgary.service-now.com/it?id=it_catalog_by_category&sys_id=4dbb82ee13661200c524fc04e144b044


== Research Data Management ==
= Research Data Management =
We recommend you follow good Research Data Management practices and ensure you have a DMP (Data Management Plan) created to guide your data's lifecycle. DMP Assistant has been created specifically for Canadian scholars and aims to meet any and all Tri-Agency requirements. See: https://assistant.portagenetwork.ca/
We recommend you follow good Research Data Management (RDM) practices and ensure you have a Data Management Plan (DMP) created to guide your data's life-cycle. Your DMP can help us support the FAIR (Findable, Accessible, Interoperable, and Reusable) principles for data management.


Your DMP can help us support the FAIR (findable, accessible, interoperable and reusable) principles for data management.
Please consider contacting Libraries and Cultural Resources for assistance.


Please consider contacting Libraries and Cultural Resources for assistance. For guidance on general data management and developing a DMP, consult https://library.ucalgary.ca/guides/researchdatamanagement or contact research.data@ucalgary.ca.
=== Resources ===


For support using PRISM Dataverse, UofC's institutional data repository, contact digitize@ucalgary.ca.
* A DMP Assistant is a tool created specifically for Canadian scholars and aims to meet any and all Tri-Agency requirements available at: https://assistant.portagenetwork.ca/
* For guidance on general data management and developing a DMP, consult https://library.ucalgary.ca/guides/researchdatamanagement or contact research.data@ucalgary.ca.
* For support using PRISM Dataverse, the University of Calgary's institutional data repository, contact digitize@ucalgary.ca.
* If you need to share and preserve your large post-publication data set for a mandated period of time, consider using the national Federated Research Data Repository (FRDR). Learn more at https://www.frdr-dfdr.ca/repo/. FRDR aligns with Tri-Agency Principles as a platform for Preservation, Retention and Sharing of research data. see: [http://www.science.gc.ca/eic/site/063.nsf/eng/h_83F7624E.html Tri-Agency Statement of Principles on Digital Data Management]


If you need to share and preserve your large post-publication data set for a mandated period of time, please visit https://www.frdr-dfdr.ca/repo/ in order to learn more about the national Federated Research Data Repository.
= University of Calgary IT storage services =
You can learn more about Information Technologies Storage solutions at https://ucalgary.service-now.com/it?id=kb_article&sys_id=d785de4e1b3ed41422ba4158dc4bcbf1


FRDR aligns with Tri-Agency Principles as a platform for Preservation, Retention and Sharing of research data. see: [http://www.science.gc.ca/eic/site/063.nsf/eng/h_83F7624E.html Tri-Agency Statement of Principles on Digital Data Management]
== OneDrive for Business ==
OneDrive for Business is a storage solution provided by Microsoft and is available to all faculty and staff.
{| class="wikitable"
! Capacity
| 5 TB with quota increases [https://ucalgary.service-now.com/it?id=sc_cat_item&sys_id=438e6d8313896a0053f2d7b2e144b0b9 available on request].
|-
! Classification
| Level 1 - 4
|}
 
You may use OneDrive for Business to store your personal and work related files. Files stored within OneDrive are by default private only to you but has the option to allow sharing and collaboration with others.
OneDrive for Business cannot be used as a department or project share space. There is no group/lab offering with OneDrive.
 
While OneDrive provides a secure/compliant location from an IT Security stand point, it’s not the most adequate location for data the PI is accountable for 5 years upon completion of the study. This is not a security issue, but a data management issue.
 
For example, if a study was using a personal OneDrive of one of the researchers to store all the records, and the researcher was to leave the university, this OneDrive would be gone in 30 days.
 
MS has an automation capability for their O365 products. If you have a windows OS machine, you can use the automation product ‘Flow’ to copy a file to a local file system when a new file is created on OneDrive.
 
To back up data residing on ARC to your personal OneDrive allocation please see: [[How to transfer data#rclone: rsync for cloud storage]]
 
OneDrive requires Multi-Factor Authentication (MFA) enabled on your University of Calgary IT account.
 
More information can be located in the following article https://ucalgary.service-now.com/kb_view.do?sysparm_article=KB0032351
 
University of Calgary OneDrive data is reportedly hosted in Canada (Markham, Ontario).
 
=== Support for OneDrive ===
If you have questions, please contact the [https://ucalgary.service-now.com/it UService Support Centre].
 
=== Other Resources ===
For more information on OneDrive for Business:
* Operating Level of Agreement KB0032404 (https://ucalgary.service-now.com/it?id=kb_article&sys_id=7f57bddcdb56a3047cab5068dc9619b6)
*OneDrive for Business Getting Started KB0032351 (https://ucalgary.service-now.com/it?id=kb_article&sys_id=60994170db2da7487cab5068dc961900)
*If you are above 90% of your OneDrive quota, you can request an increase here: ( https://ucalgary.service-now.com/it?id=sc_cat_item&sys_id=438e6d8313896a0053f2d7b2e144b0b9) PLEASE NOTE: Microsoft will only increase an allocation while the Cloud Storage is more than 90% full. Please log into your O365 cloud account to review before making your request.
 
Any questions regarding if data hosted on OneDrive is subject to US jurisdiction discovery or access should be directed to:
*https://cumming.ucalgary.ca/research-institutes/csm-research-services/legal-research-services (CSM researchers.)
*https://research.ucalgary.ca/contact/research-services (Not CSM Researchers)
*https://www.ucalgary.ca/legalservices/  (for teaching/learning – non research enquiries that make their way to you)
 
== Office365 SharePoint for research groups ==
To be determined....
 
Researchers will be able to request an Office 365 SharePoint site for a group at some point in the future
which could be considered a group cloud sharing platform.


* The official service page:
: https://ucalgary.service-now.com/it?id=sc_cat_item&sys_id=b55f2f72132f5240b5b4df82e144b085


== Secure Compute Data Storage (SCDS) ==
= University of Calgary RCS storage services =
 
=== Secure Compute Data Storage (SCDS) ===
Secure Computing Data Storage (SCDS) is a service provided by Research Computing Services that allows researchers to store restricted and confidential data. Collaboration with Level 4 data stored in SCDS is possible using ShareFile, a secure file sharing and collaboration tool by Citrix.
Secure Computing Data Storage (SCDS) is a service provided by Research Computing Services that allows researchers to store restricted and confidential data. Collaboration with Level 4 data stored in SCDS is possible using ShareFile, a secure file sharing and collaboration tool by Citrix.
   
   
Line 75: Line 128:
|}
|}


== AcademicFS ==
== ResearchFS ==
AcademicFS is a UofC hosted SMB/CIFS storage solution funded and operated by RCS. It is available by request to faculty and staff with active research data.
ResearchFS is a University of Calgary-hosted SMB/CIFS storage solution funded and operated by RCS. It is available by request to faculty and staff with active research data.
{| class="wikitable"
{| class="wikitable"
! Capacity
! Capacity
| 100GB with quota increases available on request.  
| 1TB with quota increases available on request.  
|-
|-
! Classification
! Classification
Line 87: Line 140:
| Visit [https://ucalgary.service-now.com/it?id=sc_cat_item&sys_id=fe66b3a7db297300897e4b8b0b96199d ServiceNow to request access]
| Visit [https://ucalgary.service-now.com/it?id=sc_cat_item&sys_id=fe66b3a7db297300897e4b8b0b96199d ServiceNow to request access]
|}
|}
=== Service Description ===
=== Service Description ===
You may use AcademicFS to store your active research data files. AcademicFS is intended to be used as a research group or project share. AcademicFS is available on campus or off campus using the IT supported VPN client. Information on how to download and install the VPN client can be found here: https://ucalgary.service-now.com/it?id=kb_article&sys_id=880e71071381ae006f3afbb2e144b05c (IT account login may be required).
You may use ResearchFS to store your active research data files. ResearchFS is intended to be used as a research group or project share. ResearchFS is available on campus or off campus using the IT supported VPN client. Information on how to download and install the VPN client can be found here: https://ucalgary.service-now.com/it?id=kb_article&sys_id=880e71071381ae006f3afbb2e144b05c (IT account login may be required).
All AcademicFS users must have a UofC IT account.
All ResearchFS users must have a University of Calgary IT account.


=== Data recovery ===
=== Data recovery ===
AcademicFS does daily snapshots at a bit past midnight, which it keeps for 30 days. You should be able to recover a deleted file for up to 30 days, if it was in your share overnight. If you create a file and delete it during a day, no snapshot will be available for you to recover. AcademicFS presents backups using the windows OS 'previous versions' functionality. If you are not familiar with using this, or if you are on a Linux or MacOS device, you can request a restore, with Service Now.
ResearchFS does daily snapshots at a bit past midnight, which it keeps for 30 days. You should be able to recover a deleted file for up to 30 days, if it was in your share overnight. If you create a file and delete it during a day, no snapshot will be available for you to recover. ResearchFS presents backups using the windows OS 'previous versions' functionality. If you are not familiar with using this, or if you are on a Linux or MacOS device, you can request a restore, with Service Now.


For backup, we replicate changes to a distant data center every hour. The storage hardware which hosts your data is located in the basement of the Math Sciences building and our backup is in the HRIC building, so in case of an on campus disaster, your data should be safe.
For backup, we replicate changes to a distant data center every hour. The storage hardware which hosts your data is located in the basement of the Math Sciences building and our backup is in the HRIC building, so in case of an on campus disaster, your data should be safe.
 
=== Support for AcademicFS ===
=== Support for ResearchFS ===
If you have questions, please contact the IT Support Centre.
If you have questions, please contact the IT Support Centre.
: Mon – Fri: 8:30 am – 5:00 pm; Sat, Sun & holidays: 10:00 am – 2:00 pm.
: Mon – Fri: 8:30 am – 5:00 pm; Sat, Sun & holidays: 10:00 am – 2:00 pm.
: Live Chat: ucalgary.ca/it
: Live Chat: ucalgary.ca/it
: Email: itsupport@ucalgary.ca
: Email: itsupport@ucalgary.ca
: Phone: 403.220.5555
: Phone: 403.210.9300
: In person: 773 Math Science
: In person: 773 Math Science


== ARC Cluster Storage ==
ARC storage is used to '''support workflows on the ARC computing cluster'''. The expectation is that storage on ARC will only be used for active and upcoming computational projects.
It is not suitable for long-term or archival storage as it is not backed-up and is not guaranteed to be available for the time periods that are typical of archiving.
ARC is a research cluster, which means it has high performance but can be stopped for required maintenance when needed.
Thus, ARC cannot be relied on for any kind of service that requires constant availability.
Which means, in turn that ARC's storage cannot and '''should not be used as a main storage facility for research data'''.
The '''master copy''' of research data should be '''stored elsewhere''' and only part of that data are expected to be copied to ARC for computational analysis.


=== ARC Home Directories ===
Every user account on ARC has a static 500GB allocation of storage and a maximum of 1.5 million files (including directories). This cannot be increased or decreased. Home directory storage is connected via a network file system to the rest of the cluster and supports fast data transfer to memory on compute nodes. This also means that basic file system commands (like <code>ls</code>, <code>find</code>, and <code>du</code>) take longer to run as the number of files in your home directory increases. In particular, we strongly encourage users to stay under 100000 files if it is at all possible. This can be achieved by combining smaller data files into single larger files, using structured data formats rather than large number of text files, or combining collections of files that will be used together into archives (tar, dar, etc). Since top level permissions on home directories are set to prevent other users from reading or executing, home directories are not suitable for sharing data directly with colleagues working on ARC. A Research Group Allocation is a more appropriate place for storing shared data or very large data sets that will be used as part of active computational projects. 


== OneDrive for Business ==
=== ARC /work and /bulk Group Allocation ===
OneDrive for Business is a storage solution provided by Microsoft and is available by request to all faculty and staff.
Any group member who wants to use the shared storage, should send an email to the support@hpc.ucalgary.ca to be added to the access group and CC the PI/ data owner. '''This will confirm that the PI approves the group member's request access to the shared storage.''' Please note that the access permissions inside the directory are expected to be managed by the data owners.
{| class="wikitable"
! Capacity
| 5 TB
|-
! Classification
| Level 1 - 4
|-
! Request Access
| Visit [https://ucalgary.service-now.com/it?id=sc_cat_item&sys_id=522b68ebdb83e700897e4b8b0b961997 ServiceNow to request access]
|}


You may use OneDrive for Business to store your personal and work related files.
Files stored within OneDrive are by default private only to you but has the option to allow sharing and collaboration with others.
OneDrive for Business cannot be used as a department or project share space.
There is no group/lab offering with OneDrive.


All requests should answer the following questions:


* How much storage is requested and why is that the amount that you need?
A rationale for a request can be a formal data management plan or something more informal like a rough estimate to the primary dataset used for a project and a rough estimate to the size of outputs expected from your computations that are planned to run on ARC over the '''next year.'''
* What is the requested '''allocation name'''? (typically something like <PI name>_lab, <code>smith_lab</code>, for example)
* What is the '''data classification''' using the University of Calgary data security classification system?
* Which user or users would be the '''owner''' of the allocation? (Full Name and UCalgary Email address, typically the requesting PI but there may be co-PIs)
* Which members of the allocation should be able to '''request access''' for new users? (Full Name and UCalgary Email address for active ARC users)
* What is the '''faculty''' of the owner or owners?
* Please provide a short description of the lab.
* Please provide a brief '''numerical estimate''' of the required storage space based on '''projects''' that will use the allocation and their storage '''requirements''' .


While OneDrive provides a secure/compliant location from an IT Security stand point,
it’s not the most adequate location for data the PI is accountable for 5 years upon completion of the study.
This is not a security issue, but a data management issue.


For example, if a study was using a personal OneDrive of one of the researchers to store all the records,
'''Example 1''': "We will be processing a 1T dataset by performing 100 experimental runs.
and the researcher was to leave the university, this OneDrive would be gone in 30 days.
Each experiment will be processed to produce a 6GB output, giving 600GB of the total output data.
We will also need 400GB additional space for post-processing and data management.
Thus, we would like to request '''2TB''' of shared space in total.




'''Example 2''': "3 members of our research group need additional shared space on ARC for their independent projects.
Project 1 starts with 100GB of initial data and is expected to generate 800GB of the output results.
Project 2 is going to use simulations and does not use any input data but is expected to generate 2TB of the simulated data for further processing.
The processing will require 200GB of additional space.
Project 3 will be working on a 1TB dataset and is expected to generate about 1TB of the output data.
These projects, therefore, will require 5.1TB of storage.
For convenience of data manipulation and management we would also like to have additional 400GB of extra storage space.
Therefore, we would like to request '''5.5TB''' of shared storage space in total." 


MS has an automation capability for their O365 products.
Work and Bulk storage can be considerably larger than the home directory allocations. However, there are limits on what RCS can provide as ARC storage provides high-speed access and is expensive to purchase. Typically, '''any request over 10TB''' will require some discussion. Work and Bulk allocations differ in a few ways that influence how they are used. Work storage is faster to access as part of computational jobs on ARC although the impact is small for jobs that don't involve enormous numbers of reads. Bulk storage is designed to be a target for instrument data (which is typically processed in a way that reads data a small number of times per job) and is capable of mounting instruments elsewhere on campus using SMB. A number of questions come up frequently about Work and Bulk storage and these are addressed in an [[Group Storage Allocation FAQ | FAQ]].
If you have a windows OS machine, you can use the automation product ‘Flow’ to copy a file to a local file system when a new file is created on OneDrive.


To back up data residing on ARC to your personal OneDrive allocation please see:  [[How to transfer data#rclone: rsync for cloud storage]]
= Digital Research Alliance of Canada storage services =


OneDrive requires Multi-Factor Authentication (MFA) enabled on your University of Calgary IT account.  
== Storage on the Alliance HPC clusters ==
* Alliance Wiki article "Storage and file management":
: https://docs.alliancecan.ca/wiki/Storage_and_file_management


== The Alliance NextCloud ==
For personal or level 1 data, you may use an external solution from the Alliance.
One has to have an Alliance account to use the service.
This is similar to DropBox or Google drive functionality.
:https://nextcloud.computecanada.ca
: 100 GB of storage that can be shared between your computers.
: Alliance documentation: https://docs.alliancecan.ca/wiki/Nextcloud


UofC OneDrive data is reportedly hosted in Canada (Markham Ont).
= National Data Management Ifrastructure =


===Request Access===
* The Alliance RDM information:
To request for OneDrive for Business:
: https://alliancecan.ca/en/services/research-data-management
#Submit your request on ServiceNow using the OneDrive for Business request form (https://ucalgary.service-now.com/it?id=sc_cat_item&sys_id=522b68ebdb83e700897e4b8b0b961997)
#The IT Support Centre will contact you
#Set a time with IT Support Centre to turn on MFA
# Turn on MFA
#Turn on OneDrive for Business


===Support for OneDrive for Business===
* Alliance notes on Research Data Management:
If you have questions, please contact the IT Support Centre.
: https://docs.alliancecan.ca/wiki/Research_Data_Management
: Mon – Fri: 8:30 am – 5:00 pm; Sat, Sun & holidays: 10:00 am – 2:00 pm.
:Live Chat: ucalgary.ca/it
:Email: itsupport@ucalgary.ca
: Phone: 403.220.5555
:In person: 773 Math Science


===Data recovery===
== Borealis Dataverse Repository ==


===Other Resources===
* https://borealisdata.ca/
For more information on OneDrive for Business:
* Operating Level of Agreement KB0032404 (https://ucalgary.service-now.com/it?id=kb_article&sys_id=7f57bddcdb56a3047cab5068dc9619b6)
*OneDrive for Business Getting Started KB0032351 (https://ucalgary.service-now.com/it?id=kb_article&sys_id=60994170db2da7487cab5068dc961900)
*If you are above 90% of your OneDrive quota, you can request an increase here: ( https://ucalgary.service-now.com/it?id=sc_cat_item&sys_id=438e6d8313896a0053f2d7b2e144b0b9) PLEASE NOTE: Microsoft will only increase an allocation while the Cloud Storage is more than 90% full. Please log into your O365 cloud account to review before making your request.


Any questions regarding if data hosted on OneDrive is subject to US jurisdiction discovery or access should be directed to:
Borealis, the Canadian Dataverse Repository, is a bilingual, multidisciplinary, secure, Canadian research data repository, supported by academic libraries and research institutions across Canada. Borealis supports open discovery, management, sharing, and preservation of Canadian research data.
*https://cumming.ucalgary.ca/research-institutes/csm-research-services/legal-research-services (CSM researchers.)
*https://research.ucalgary.ca/contact/research-services (Not CSM Researchers)
*https://www.ucalgary.ca/legalservices/  (for teaching/learning – non research enquiries that make their way to you)
 
==Office365 SharePoint for research groups==
 
To be determined....
 
Researchers will be able to request an Office 365 SharePoint site for a group at some point in the future
which could be considered a group cloud sharing platform.
 
* The official service page:
: https://ucalgary.service-now.com/it?id=sc_cat_item&sys_id=b55f2f72132f5240b5b4df82e144b085
 
==Personal storage options==
For personal or level 1 data, you may use an external solution from the Alliance.
One has to have an Alliance account to use the service.
This is similar to DropBox or Google drive functionality.  


*'''The Alliance NextCloud''':
== Federated Research Data Repository (FRDR) ==
:https://nextcloud.computecanada.ca
The Federated Research Data Repository (RFRDR) is a suitable storage solution for long-term archive storage for research datasets used in published research work. FRDR is a bilingual publishing platform for sharing and preserving Canadian research data.
: 100 GB of storage that can be shared between your computers.
It is a curated, general-purpose repository, custom built for large datasets.
: Alliance documentation: https://docs.alliancecan.ca/wiki/Nextcloud
FRDR is run by the Digital Research Alliance of Canada.


== Home Directories and Research Group Allocations on ARC ==
For more information on FRDR visit their web site:  https://www.frdr-dfdr.ca/repo/


ARC storage is used to support workflows on the ARC computing cluster. The expectation is that storage on ARC will only be used for active and upcoming computational projects. It is not suitable for long-term or archival storage as it is not backed-up and is not guaranteed to be available for the time periods that are typical of archiving.
= Commercial Cloud Based Storage Options =
ARC is a research cluster, which means it has high performance but can be stopped for required maintenance when needed.
Thus, ARC cannot be relied on for any kind of service that requires constant availability.
Which means, in turn that ARC's storage cannot and should not be used as a main storage facility for research data.
The master copy of research data should be stored elsewhere and only part of that data are expected to be copied to ARC for computational analysis.


=== Home Directories ===
== Amazon Web Services ==
Every user account on ARC has a static 500GB allocation of storage and a maximum of 1.5 million files (including directories). This cannot be increased or decreased. Home directory storage is connected via a network file system to the rest of the cluster and supports fast data transfer to memory on compute nodes. This also means that basic file system commands (like <code>ls</code>, <code>find</code>, and <code>du</code>) take longer to run as the number of files in your home directory increases. In particular, we strongly encourage users to stay under 100000 files if it is at all possible. This can be achieved by combining smaller data files into single larger files, using structured data formats rather than large number of text files, or combining collections of files that will be used together into archives (tar, dar, etc). Since top level permissions on home directories are set to prevent other users from reading or executing, home directories are not suitable for sharing data directly with colleagues working on ARC. A Research Group Allocation is a more appropriate place for storing shared data or very large data sets that will be used as part of active computational projects.  
Provided by '''Amazon Web Services, Inc.'''.


=== Research Group Allocations (<code>/work</code> and <code>/bulk</code>) ===
* Pricing calculator: https://calculator.aws
The principal investigator (PI) for a research group may request an extended shared allocation for the research group by contacting support@hpc.ucalgary.ca with answers to the following questions (please copy the full text of the questions into your email and write answers under it):


* How much storage is requested and why is that the amount that you need?
A rationale for a request can be a formal data management plan or something more informal like a rough estimate to the primary dataset used for a project and a rough estimate to the size of outputs expected from your computations that are planned to run on ARC over the '''next year.'''


* What is the requested allocation name? (typically something like <PI name>_lab)
AWS provides very many different kinds of services, including '''storage services'''.
* What is the data classification using the University of Calgary data security classification system?
* Which user or users would be the owner of the allocation? (Full Name and UCalgary Email address, typically the requesting PI but there may be co-PIs)
* Which members of the allocation should be able to request access for new users? (Full Name and UCalgary Email address for active ARC users)
* What is the faculty of the owner or owners?
* Please provide a short description of the lab or project that will use the allocation.


These options can be a solution for your research needs, but
* it '''can be expensive''', depending on your needs and the amount of data;
* '''pricing schema is complex''' and can be confusing for new users;
* the '''number of options can be overwhelming'''. A lot of it is designed to provide pricing flexibility, not to increase functionality.


'''Example 1''': "We will be processing a 3T dataset consisting of 1000 experimental runs. Each experiment will be processed to produce a 6GB output and we will need some further space for post-processing. We would like to request 12TB total." 


'''Example 2''': "Our research group has 5 members with separate projects. 3 have projects that will use 1TB of data and 2 have projects that will require 3TB of data. We would like to request 10TB total."
The key points:
* '''Uploading data''' to AWS is '''free'''.
* '''Storing data''' on AWS storage is a '''paid service'''.
* '''Downloading data''' from AWS storage to your computer is a '''paid service'''.


 
[[Category:Administration]]
Work and Bulk storage can be considerably larger than the home directory allocations. However, there are limits on what RCS can provide as ARC storage provides high-speed access and is expensive to purchase. Typically, '''any request over 10TB''' will require some discussion. Work and Bulk allocations differ in a few ways that influence how they are used. Work storage is faster to access as part of computational jobs on ARC although the impact is small for jobs that don't involve enormous numbers of reads. Bulk storage is designed to be a target for instrument data (which is typically processed in a way that reads data a small number of times per job) and is capable of mounting instruments elsewhere on campus using SMB. A number of questions come up frequently about Work and Bulk storage and these are addressed in an [[Group Storage Allocation FAQ | FAQ]].
{{Navbox Administration}}
[[Category:Guides]]

Latest revision as of 18:58, 11 September 2024

There are a few options researchers can take advantage of when storing their research data. Please take into account the purpose of the storage, appropriate research data management principles, and the data classification when choosing an appropriate storage solution.

Data Classification

Please review the different data classifications that are outlined by the Information Security Classification Standard. There are 4 levels of data classification which are summarised in the table below.

Level Description Example
Level 1 Public
  • Reference data sets
  • Published research data
Level 2 Internal
  • Internal memos
  • Unpublished research data
  • Anonymized or de-identified human subject data
  • Library transactions and journals
Level 3 Confidential
  • Faculty/staff employment applications, personnel files, contact information
  • Donor or prospective donor information
  • Contracts
  • Intellectual property
Level 4 Restricted
  • Patient identifiable health information
  • identifiable human subject research data
  • information subject to special government requirements

When selecting a storage option, you must use one that meets or exceeds the rated security classification.

  • See also the Collaboration, storage and file shares article in Service Now:
https://ucalgary.service-now.com/it?id=it_catalog_by_category&sys_id=4dbb82ee13661200c524fc04e144b044

Research Data Management

We recommend you follow good Research Data Management (RDM) practices and ensure you have a Data Management Plan (DMP) created to guide your data's life-cycle. Your DMP can help us support the FAIR (Findable, Accessible, Interoperable, and Reusable) principles for data management.

Please consider contacting Libraries and Cultural Resources for assistance.

Resources

University of Calgary IT storage services

You can learn more about Information Technologies Storage solutions at https://ucalgary.service-now.com/it?id=kb_article&sys_id=d785de4e1b3ed41422ba4158dc4bcbf1

OneDrive for Business

OneDrive for Business is a storage solution provided by Microsoft and is available to all faculty and staff.

Capacity 5 TB with quota increases available on request.
Classification Level 1 - 4

You may use OneDrive for Business to store your personal and work related files. Files stored within OneDrive are by default private only to you but has the option to allow sharing and collaboration with others. OneDrive for Business cannot be used as a department or project share space. There is no group/lab offering with OneDrive.

While OneDrive provides a secure/compliant location from an IT Security stand point, it’s not the most adequate location for data the PI is accountable for 5 years upon completion of the study. This is not a security issue, but a data management issue.

For example, if a study was using a personal OneDrive of one of the researchers to store all the records, and the researcher was to leave the university, this OneDrive would be gone in 30 days.

MS has an automation capability for their O365 products. If you have a windows OS machine, you can use the automation product ‘Flow’ to copy a file to a local file system when a new file is created on OneDrive.

To back up data residing on ARC to your personal OneDrive allocation please see: How to transfer data#rclone: rsync for cloud storage

OneDrive requires Multi-Factor Authentication (MFA) enabled on your University of Calgary IT account.

More information can be located in the following article https://ucalgary.service-now.com/kb_view.do?sysparm_article=KB0032351

University of Calgary OneDrive data is reportedly hosted in Canada (Markham, Ontario).

Support for OneDrive

If you have questions, please contact the UService Support Centre.

Other Resources

For more information on OneDrive for Business:

Any questions regarding if data hosted on OneDrive is subject to US jurisdiction discovery or access should be directed to:

Office365 SharePoint for research groups

To be determined....

Researchers will be able to request an Office 365 SharePoint site for a group at some point in the future which could be considered a group cloud sharing platform.

  • The official service page:
https://ucalgary.service-now.com/it?id=sc_cat_item&sys_id=b55f2f72132f5240b5b4df82e144b085

University of Calgary RCS storage services

Secure Compute Data Storage (SCDS)

Secure Computing Data Storage (SCDS) is a service provided by Research Computing Services that allows researchers to store restricted and confidential data. Collaboration with Level 4 data stored in SCDS is possible using ShareFile, a secure file sharing and collaboration tool by Citrix.

Capacity 10 GB or more
Classification Level 4
Learn More Visit The SCDS Website
Request Access Visit ServiceNow to request access

ResearchFS

ResearchFS is a University of Calgary-hosted SMB/CIFS storage solution funded and operated by RCS. It is available by request to faculty and staff with active research data.

Capacity 1TB with quota increases available on request.
Classification Level 1 - 2
Request Access Visit ServiceNow to request access

Service Description

You may use ResearchFS to store your active research data files. ResearchFS is intended to be used as a research group or project share. ResearchFS is available on campus or off campus using the IT supported VPN client. Information on how to download and install the VPN client can be found here: https://ucalgary.service-now.com/it?id=kb_article&sys_id=880e71071381ae006f3afbb2e144b05c (IT account login may be required). All ResearchFS users must have a University of Calgary IT account.

Data recovery

ResearchFS does daily snapshots at a bit past midnight, which it keeps for 30 days. You should be able to recover a deleted file for up to 30 days, if it was in your share overnight. If you create a file and delete it during a day, no snapshot will be available for you to recover. ResearchFS presents backups using the windows OS 'previous versions' functionality. If you are not familiar with using this, or if you are on a Linux or MacOS device, you can request a restore, with Service Now.

For backup, we replicate changes to a distant data center every hour. The storage hardware which hosts your data is located in the basement of the Math Sciences building and our backup is in the HRIC building, so in case of an on campus disaster, your data should be safe.

Support for ResearchFS

If you have questions, please contact the IT Support Centre.

Mon – Fri: 8:30 am – 5:00 pm; Sat, Sun & holidays: 10:00 am – 2:00 pm.
Live Chat: ucalgary.ca/it
Email: itsupport@ucalgary.ca
Phone: 403.210.9300
In person: 773 Math Science

ARC Cluster Storage

ARC storage is used to support workflows on the ARC computing cluster. The expectation is that storage on ARC will only be used for active and upcoming computational projects. It is not suitable for long-term or archival storage as it is not backed-up and is not guaranteed to be available for the time periods that are typical of archiving. ARC is a research cluster, which means it has high performance but can be stopped for required maintenance when needed. Thus, ARC cannot be relied on for any kind of service that requires constant availability. Which means, in turn that ARC's storage cannot and should not be used as a main storage facility for research data. The master copy of research data should be stored elsewhere and only part of that data are expected to be copied to ARC for computational analysis.

ARC Home Directories

Every user account on ARC has a static 500GB allocation of storage and a maximum of 1.5 million files (including directories). This cannot be increased or decreased. Home directory storage is connected via a network file system to the rest of the cluster and supports fast data transfer to memory on compute nodes. This also means that basic file system commands (like ls, find, and du) take longer to run as the number of files in your home directory increases. In particular, we strongly encourage users to stay under 100000 files if it is at all possible. This can be achieved by combining smaller data files into single larger files, using structured data formats rather than large number of text files, or combining collections of files that will be used together into archives (tar, dar, etc). Since top level permissions on home directories are set to prevent other users from reading or executing, home directories are not suitable for sharing data directly with colleagues working on ARC. A Research Group Allocation is a more appropriate place for storing shared data or very large data sets that will be used as part of active computational projects.

ARC /work and /bulk Group Allocation

Any group member who wants to use the shared storage, should send an email to the support@hpc.ucalgary.ca to be added to the access group and CC the PI/ data owner. This will confirm that the PI approves the group member's request access to the shared storage. Please note that the access permissions inside the directory are expected to be managed by the data owners.


All requests should answer the following questions:

  • How much storage is requested and why is that the amount that you need?

A rationale for a request can be a formal data management plan or something more informal like a rough estimate to the primary dataset used for a project and a rough estimate to the size of outputs expected from your computations that are planned to run on ARC over the next year.

  • What is the requested allocation name? (typically something like <PI name>_lab, smith_lab, for example)
  • What is the data classification using the University of Calgary data security classification system?
  • Which user or users would be the owner of the allocation? (Full Name and UCalgary Email address, typically the requesting PI but there may be co-PIs)
  • Which members of the allocation should be able to request access for new users? (Full Name and UCalgary Email address for active ARC users)
  • What is the faculty of the owner or owners?
  • Please provide a short description of the lab.
  • Please provide a brief numerical estimate of the required storage space based on projects that will use the allocation and their storage requirements .


Example 1: "We will be processing a 1T dataset by performing 100 experimental runs. Each experiment will be processed to produce a 6GB output, giving 600GB of the total output data. We will also need 400GB additional space for post-processing and data management. Thus, we would like to request 2TB of shared space in total."


Example 2: "3 members of our research group need additional shared space on ARC for their independent projects. Project 1 starts with 100GB of initial data and is expected to generate 800GB of the output results. Project 2 is going to use simulations and does not use any input data but is expected to generate 2TB of the simulated data for further processing. The processing will require 200GB of additional space. Project 3 will be working on a 1TB dataset and is expected to generate about 1TB of the output data. These projects, therefore, will require 5.1TB of storage. For convenience of data manipulation and management we would also like to have additional 400GB of extra storage space. Therefore, we would like to request 5.5TB of shared storage space in total."

Work and Bulk storage can be considerably larger than the home directory allocations. However, there are limits on what RCS can provide as ARC storage provides high-speed access and is expensive to purchase. Typically, any request over 10TB will require some discussion. Work and Bulk allocations differ in a few ways that influence how they are used. Work storage is faster to access as part of computational jobs on ARC although the impact is small for jobs that don't involve enormous numbers of reads. Bulk storage is designed to be a target for instrument data (which is typically processed in a way that reads data a small number of times per job) and is capable of mounting instruments elsewhere on campus using SMB. A number of questions come up frequently about Work and Bulk storage and these are addressed in an FAQ.

Digital Research Alliance of Canada storage services

Storage on the Alliance HPC clusters

  • Alliance Wiki article "Storage and file management":
https://docs.alliancecan.ca/wiki/Storage_and_file_management

The Alliance NextCloud

For personal or level 1 data, you may use an external solution from the Alliance. One has to have an Alliance account to use the service. This is similar to DropBox or Google drive functionality.

https://nextcloud.computecanada.ca
100 GB of storage that can be shared between your computers.
Alliance documentation: https://docs.alliancecan.ca/wiki/Nextcloud

National Data Management Ifrastructure

  • The Alliance RDM information:
https://alliancecan.ca/en/services/research-data-management
  • Alliance notes on Research Data Management:
https://docs.alliancecan.ca/wiki/Research_Data_Management

Borealis Dataverse Repository

Borealis, the Canadian Dataverse Repository, is a bilingual, multidisciplinary, secure, Canadian research data repository, supported by academic libraries and research institutions across Canada. Borealis supports open discovery, management, sharing, and preservation of Canadian research data.

Federated Research Data Repository (FRDR)

The Federated Research Data Repository (RFRDR) is a suitable storage solution for long-term archive storage for research datasets used in published research work. FRDR is a bilingual publishing platform for sharing and preserving Canadian research data. It is a curated, general-purpose repository, custom built for large datasets. FRDR is run by the Digital Research Alliance of Canada.

For more information on FRDR visit their web site: https://www.frdr-dfdr.ca/repo/

Commercial Cloud Based Storage Options

Amazon Web Services

Provided by Amazon Web Services, Inc..


AWS provides very many different kinds of services, including storage services.

These options can be a solution for your research needs, but

  • it can be expensive, depending on your needs and the amount of data;
  • pricing schema is complex and can be confusing for new users;
  • the number of options can be overwhelming. A lot of it is designed to provide pricing flexibility, not to increase functionality.


The key points:

  • Uploading data to AWS is free.
  • Storing data on AWS storage is a paid service.
  • Downloading data from AWS storage to your computer is a paid service.