Submitted:
07 August 2023
Posted:
08 August 2023
You are already at the latest version
Abstract

Keywords:
- Findings
- Background
- Our approach
- Testing
- Datasets
- File types
- Testing procedure
- Testing configuration
- Uploaded by command-line using any of the aws s3 transfer commands, which include the cp, sync, mv, and rm commands.
-
Using the default values established for the following aws s3 configuration parameters:
- max_concurrent_requests - default: 10.
- max_queue_size - default: 1000.
- multipart_threshold - default: 8 (MB).
- multipart_chunksize - default: 8 (MB).
- max_bandwidth - default: none.
- use_accelerate_endpoint - default: false.
- use_dualstack_endpoint - default: false.
- addressing_style - default: auto.
- payload_signing_enabled - default: false.
- Support
- Limitations
- Methods
- Main script
- Docker image
-
[-v <path_local_folder>:<path_local_folder>]. Required argument. This argument requires replacing the strings [<path_local_folder>:<path_local_folder>] with the absolute path to the local folder containing the local version of the remote S3 files to be tested. This argument is used to mount the local folder as a local volume to the Docker image, allowing Docker to have read access over the local files to be tested. Important: the local folder should be referenced by using the absolute path.
- ○
- Example: -v /data/nucCyt:/data/nucCyt
- [-v "$PWD/logs/:/usr/src/logs"]. Required argument. This argument should not be changed and, therefore, it should be used as it is shown. It represents the path to the local logs folder and is used to mount the local logs folder as a local volume to the Docker image. It allows Docker to record the outputs produced during the tool execution.
- [-v "$HOME/.aws:/root/.aws:ro"]. Required argument. This argument should not be changed and, therefore, it should be used as it is shown. It represents the path to the local folder containing the information about the user authentication on AWS. This parameter is used to mount the local AWS credential directory as a read-only volume to the Docker image, allowing Docker to have read access to the authentication information of the user on AWS.
- docker run -v /data/nucCyt:/data/nucCyt -v "$PWD/logs:/usr/src/logs" -v "$HOME/.aws:/root/.aws:ro" soniaruiz/aws-s3-integrity-check:latest -l /data/nucCyt/ -b nuccyt -p my_aws_profile
- docker run -v /data/nucCyt:/data/nucCyt -v "$PWD/logs:/usr/src/logs" -v "$HOME/.aws:/root/.aws:ro" soniaruiz/aws-s3-integrity-check:latest -l /data/nucCyt/ -b nuccyt
- Availability and requirements
- Project name: aws-s3-integrity-check: an open-source bash tool to verify the integrity of a dataset stored on Amazon S3
- Project homepage: https://github.com/SoniaRuiz/aws-s3-integrity-check, DOI: 10.5281/zenodo.8217517
- DockerHub URL: https://hub.docker.com/r/soniaruiz/aws-s3-integrity-check
- Protocols.io: https://www.protocols.io/view/check-the-integrity-of-a-dataset-stored-on-amazon-n92ld9qy9g5b/v2 (DOI: dx.doi.org/10.17504/protocols.io.n92ld9qy9g5b/v2)
- Operating system: Ubuntu 16.04.7 LTS (Xenial Xerus), Ubuntu 18.04.6 LTS (Bionic Beaver), Ubuntu server 22.04.1 LTS (Jammy Jellyfish).
- Programming language: Bash
-
Other requirements:
- ○
- jq (version jq-1.5-1-a5b5cbe, https://stedolan.github.io/jq/)
- ○
- xxd (version 1.10 27oct98 by Juergen Weigert, https://manpages.ubuntu.com/manpages/bionic/en/man1/xxd.1.html).
- ○
- s3md5 (https://github.com/antespi/s3md5)
- ○
- AWS Command Line Interface (CLI), (version 2, https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)
- ○
- Docker (version 18.09.7, build 2d0083d, https://www.docker.com/)
- License: Apache-2.0 license
- Availability of supporting data
Author Contributions
Acknowledgments
Competing Interests
Abbreviations
| Amazon S3 | Amazon Simple Storage Service; |
| API | Application Programming Interface; |
| AWS | Amazon Web Services; |
| AWS CLI | AWS Command Line Interface; |
| DOI | Digital Object Identifier; |
| EGA | European Genome-phenome Archive; |
| ETag | Entity Tag; |
| FTP | File Transfer Protocol; |
| GB | Gigabytes; |
| JSON | JavaScript Object Notation; |
| MB | Megabytes; |
| NGS | Next Generation Sequencing; |
| SSE-C | Server-side encryption with customer-provided encryption keys; |
| SSE-KMS | Server-side encryption with AWS Key Management Service keys; |
| SSE-S3 | Server-side encryption with Amazon S3 managed keys; |
| SSO | Single Sign-On; |
| TB | Terabytes; |
References
- Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 2016, 17, 333–351. [Google Scholar] [CrossRef] [PubMed]
- Marx, V. Method of the year: long-read sequencing. Nat Methods. 2023, 20, 6–11. [Google Scholar] [CrossRef] [PubMed]
- Angerer P, Simon L, Tritschler S, Wolf FA, Fischer D, Theis FJ. Single cells make big data: New challenges and opportunities in transcriptomics. Current Opinion in Systems Biology. 2017, 4, 85–91. [Google Scholar] [CrossRef]
- Schmidt B, Hildebrandt A. Next-generation sequencing: big data meets high performance computing. Drug Discov Today. 2017, 22, 712–717. [Google Scholar] [CrossRef] [PubMed]
- Fang S, Chen B, Zhang Y, Sun H, Liu L, Liu S, et al. Computational approaches and challenges in spatial transcriptomics. Computational approaches and challenges in spatial transcriptomics. Genomics Proteomics Bioinformatics. 2022.
- Cloud Computing Services - Amazon Web Services (AWS). https://aws.amazon.com/. Accessed 14 Apr 2023.
- Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL. Searching for SNPs with cloud computing. Genome Biol. 2009, 10, R134. [Google Scholar]
- Wall DP, Kudtarkar P, Fusaro VA, Pivovarov R, Patil P, Tonellato PJ. Cloud computing for comparative genomics. BMC Bioinformatics. 2010, 11, 259. [Google Scholar]
- Halligan BD, Geiger JF, Vallejos AK, Greene AS, Twigger SN. Low cost, scalable proteomics data analysis using Amazon’s cloud computing services and open source search algorithms. J Proteome Res. 2009, 8, 3148–3153. [Google Scholar] [CrossRef]
- Dickens PM, Larson JW, Nicol DM. Diagnostics for causes of packet loss in a high performance data transfer system. In: 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings. IEEE; 2004. p. 55–64.
- RFC 1864 - The Content-MD5 Header Field. https://datatracker.ietf.org/doc/html/rfc1864. Accessed 14 Apr 2023.
- Checking object integrity - Amazon Simple Storage Service. https://docs.aws.amazon.com/AmazonS3/latest/userguide/checking-object-integrity.html. Accessed 31 Jul 2023.
- AWS CLI S3 Configuration — AWS CLI 1.27.115 Command Reference. https://docs.aws.amazon.com/cli/latest/topic/s3-config.html. Accessed 19 Apr 2023.
- antespi/s3md5: Bash script to calculate Etag/S3 MD5 sum for very big files uploaded using multipart S3 API. https://github.com/antespi/s3md5. Accessed 16 Apr 2023.
- Freeberg MA, Fromont LA, D’Altri T, Romero AF, Ciges JI, Jene A, et al. The European Genome-phenome Archive in 2021. Nucleic Acids Res. 2022, 50, D980–7. [Google Scholar] [CrossRef] [PubMed]
- Sneddon TP, Zhe XS, Edmunds SC, Li P, Goodman L, Hunter CI. GigaDB: promoting data dissemination and reproducibility. Database (Oxford). 2014, 2014, bau018. [Google Scholar] [CrossRef]
- sync — AWS CLI 2.11.13 Command Reference. https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/sync.html. Accessed 16 Apr 2023.
- GigaDB Dataset -. https://doi.org/10.5524/102374 - Supporting data for "Delineating Regions-of-interest for Mass Spectrometry Imaging by Multimodally C ... http://gigadb.org/dataset/102374. Accessed 12 May 2023.
- Feleke R, Reynolds RH, Smith AM, Tilley B, Taliun SAG, Hardy J, et al. Cross-platform transcriptional profiling identifies common and distinct molecular pathologies in Lewy body diseases. Acta Neuropathol. 2021, 142, 449–474. [Google Scholar] [CrossRef] [PubMed]
- GigaDB Dataset -. https://doi.org/10.5524/102379 - Supporting data for "TF-Prioritizer: a java pipeline to prioritize condition-specific transcription ... http://gigadb.org/dataset/102379. Accessed 12 May 2023.
- Guelfi S, D’Sa K, Botía JA, Vandrovcova J, Reynolds RH, Zhang D, et al. Regulatory sites for splicing in human basal ganglia are enriched for disease-relevant information. Nat Commun. 2020, 11, 1041. [Google Scholar] [CrossRef] [PubMed]
- time(1) - Linux manual page. https://man7.org/linux/man-pages/man1/time.1.html. Accessed 30 Jul 2023.
- NumPy documentation — NumPy v1.25.dev0 Manual. https://numpy.org/devdocs/index.html. Accessed 12 May 2023.
- IBM Documentation. https://www.ibm.com/docs/en/aix/7.1?topic=g-getopts-command. Accessed 16 Apr 2023.
- ls — AWS CLI 2.11.13 Command Reference. https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/ls.html. Accessed 16 Apr 2023.
- list-objects — AWS CLI 1.27.114 Command Reference. https://docs.aws.amazon.com/cli/latest/reference/s3api/list-objects.html. Accessed 16 Apr 2023.
- md5sum invocation (GNU Coreutils 9.2). https://www.gnu.org/software/coreutils/manual/html_node/md5sum-invocation.html#md5sum-invocation. Accessed 14 Apr 2023.
- Rivest R. The MD5 Message-Digest Algorithm. RFC Editor; 1992.
- md5sum(1): compute/check MD5 message digest - Linux man page. https://linux.die.net/man/1/md5sum. Accessed 16 Apr 2023.
- jq. https://stedolan.github.io/jq/. Accessed 16 Apr 2023.
- Docker: Accelerated, Containerized Application Development. https://www.docker.com/. Accessed 16 Apr 2023.



| Amazon S3 Bucket | Data Origin | Details | Number of files tested | Bucket Size | Processing time | Log file |
|---|---|---|---|---|---|---|
| mass-spectrometry-imaging | GigaDB | Imaging-type supporting data for the publication "Delineating Regions-of-interest for Mass Spectrometry Imaging by Multimodally Corroborated Spatial Segmentation" [18]. | 36 | 16 GB | real 1m52.193s user 1m8.964s sys 0m24.404s |
logs/mass-spectrometry-imaging.S3_integrity_log.2023.07.31-22.59.01.txt |
| rnaseq-pd | EGA | Contents of the EGA dataset EGAS00001006380, containing bulk-tissue RNA-sequencing paired nuclear and cytoplasmic fractions of the anterior prefrontal cortex, cerebellar cortex and putamen tissues from post-mortem neuropathologically-confirmed control individuals [19]. | 872 | 479 GB | real 62m56.793s user 36m26.604s sys 16m10.548s |
logs/rnaseq-pd.S3_integrity_log.2023.07.31-23.02.47.txt |
| tf-prioritizer | GigaDB | Software-type supporting data for the publication "TF-Prioritizer: a java pipeline to prioritize condition-specific transcription factors" [20]. | 6 | 3.7 MB | real 0m15.131s user 0m2.012s sys 0m0.240s |
logs/tf-prioritizer.S3_integrity_log.2023.07.31-22.58.33.txt |
| ukbec-unaligned-fastq | EGA | A subset of the EGA dataset EGAS00001003065, containing RNA-sequencing Fastq files generated from 180 putamen and substantia nigra control samples [21]. | 131 | 440 GB | real 51m12.058s user 31m27.348s sys 14m7.084s |
logs/ukbec-unaligned-fastq.S3_integrity_log.2023.08.01-01.03.58.txt |
| File type | Description |
|---|---|
| Bam | Compressed binary version of a SAM file that is used to represent aligned sequences up to 128 Mb. |
| Bed | Browser Extensible Data (BED) format. This file format is used to store genomic regions as coordinates. |
| Csv | Comma-Separated Values (CSV). |
| Docx | File format for Microsoft Word documents. |
| Fa | File containing information about DNA sequences and other related pieces of scientific information. |
| Fastq | Text-based format for storing genome sequencing data and quality scores. |
| Gct | Gene Cluster Text (GCT). This is a tab-delimited text format file that contains gene expression data. |
| Gff | General Feature Format (GFF) is a file format used for describing genes and other features of DNA, RNA and protein sequences. |
| Gz | A file compressed by the standard GNU zip (gzip). |
| Html | HyperText Markup Language file. |
| Ibd | Pre-processed mass spectrometry imaging (MSI) data. |
| imzML | Imaging Mass Spectrometry Markup Language. Contains raw mass spectrometry imaging (MSI) data. |
| Ipynb | Computational notebooks that can be opened with Jupyter Notebook. |
| Jpg | Compressed image format for containing digital images. |
| JSON | JavaScript Object Notation. Text-based format to represent structured data based on JavaScript object syntax. |
| md5 | Checksum file. |
| Msa | Multiple sequence alignment file. It generally contains the alignment of three or more biological sequences of similar length. |
| Mtx | Sparse matrix format. This contains genes in the rows and cells in the columns. It is produced as output by Cell Ranger. |
| Npy | Standard binary file format in NumPy [23] for saving numpy arrays. |
| Nwk | Newick tree file format to represent graph-theoretical trees with edge lengths using parentheses and commas. |
| Portable Document Format (PDF). | |
| Py | Python file. |
| Pyc | Compiled bytecode file generated by the Python interpreter after a Python script is imported or executed. |
| R | R language script format. |
| Svg | Scalable Vector Graphics (SVG). This is a vector file format. |
| Tab | Tab-delimited text or data files. |
| Tif | Tag Image File Format. Tif is a computer file used to store raster graphics and image information. |
| Tsv | Tab-separated values (TSV) to store text-based tabular data. |
| Txt | Text document file. |
| Vcf | Variant Call Format. Text file for storing gene sequence variations. |
| Xls | Microsoft Excel Binary File format. |
| Zip | A file containing one or more compressed files. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).