Crawling Module#
Module: pyradise.fileio.crawling
General#
The crawling module provides functionality to search for loadable data in a filesystem
hierarchy and to construct intermediate information (i.e. SeriesInfo and
subclasses) for enabling the subject construction in the data loading process. The advantage of this intermediate
loading step is that unnecessary data does not need to be loaded and can be skipped before the loading process (see
Selection Module). This is especially useful because working with imaging data is often memory intensive.
Furthermore, the loading of DICOM data may involve conversion steps (i.e. in case or available registration files
or DICOM-RTSS) which are time consuming and can be omitted using this intermediate step.
The crawling module provides separate crawlers for discrete image files and for DICOM data
because we assume that the user is either working with discrete image files or with DICOM data. For both input data
types crawlers for single subject directory or
multi-subject datasets are provided. Single subject directory crawlers require that the
data of the subject is contained in a single folder including sub-folders. For multi-subject datasets each subject
must have its own directory at the top-most hierarchy level. For the dataset crawlers a execute-at-once and an
iterative approach is provided. The iterative approach is especially useful if the user wants to process the data
sequentially and wants to keep the memory footprint low. On the other hand, the execute-at-once approach may be useful
if the user wants to analyse the data in parallel.
Due to the fact that not all information necessary for Subject creation is available in
the file content the crawlers provide interfaces for information retrieval methods. The discrete image file
crawlers (recognized by the word File in their name) provide interfaces for three types of
Extractor s (i.e. ModalityExtractor,
OrganExtractor, and AnnotatorExtractor) which
need to be implemented by the user for its specific task. Typically, the Extractor
s use parts of the file name to retrieve the necessary information (e.g. the modality from the file name or from a
lookup table). Because DICOM data is more structured than discrete file formats and the information
about the annotator and the organs can be accessed directly in the DICOM-RTSS, DICOM crawlers provide just interfaces for
retrieving the modality information of DICOM image data. This is essential when working with subject data consisting of
multiple images from the same modality because DICOM provides just minimal information about the imaging modality such
as for example MR for all types of MR-sequences. This minimal information may not be sufficient in many radiotherapy
applications because working with different MR-sequences is common and a discrimination between the MR-sequences is
evident to feed the different MR-sequences in the correct order through a processing pipeline or a DL-model.
Thus, the user must have a mechanism to distinguish between the different uni-modal images. The DICOM crawlers provide
two separate mechanisms to retrieve detailed modality information. The first and prioritized approach is using a
modality configuration file (see modality_config for more details) which stores a persistent
mapping between the DICOM SeriesInstanceUID its modality. The skeleton of this file can be generated automatically with
the appropriate crawler and needs to be modified accordingly by the user. The second approach is using a user-defined
ModalityExtractor which extracts the necessary modality details directly from the
DICOM file content. Both approaches provide the same functionality but have distinctive advantages. While the modality
configuration file approach may be more convenient for recurrent work on the same data the extractor approach may be
better suited for building deployable solutions for which the modality details can be retrieved rule-based. The
selection of the appropriate approach is up to the user.
Data Structure & Hierarchy#
Data Structure for Subject Crawlers
<subject_dir>
├── <file_0>
├── <file_1>
└── ...
Data Structure for Dataset Crawlers
<dataset_dir>
├── <subject_0>
│ ├── <file_0>
│ ├── <file_1>
│ └── ...
├── <subject_1>
│ ├── <file_0>
│ ├── <file_1>
│ └── ...
└── ...
Class Overview#
The following Crawler classes are provided by the crawling module:
Class |
Description |
|---|---|
Base class for all |
|
Crawler class for discrete image files in a single subject directory |
|
Crawler class for discrete image files in a dataset directory |
|
Crawler class for DICOM files in a single subject directory |
|
Crawler class for DICOM files in a dataset directory |
Details#
- class Crawler(path)[source]#
Bases:
ABCAn abstract crawler whose subtypes are intended to be used for searching files of a certain type in a specified location or within a hierarchy of directories.
- Parameters:
path (str) – The directory path for which the crawling will be performed.
- class SubjectFileCrawler(path, subject_name, extension, modality_extractor, organ_extractor, annotator_extractor)[source]#
Bases:
CrawlerA crawler for retrieving
FileSeriesInfoentries from a subject directory containing discrete image files of a specified type (seeextensionparameter).The
SubjectFileCrawleris used for searching appropriate files within a specific subject directory containing all the subject’s data. If there are multiple subjects in separate directories but within a common top-level directory to be crawled we recommend using theDatasetFileCrawler.Important
The DICOM format is not supported by this crawler. Use the appropriate crawler variant instead.
- Raises:
ValueError – If the
extensionparameter specifies the DICOM file extension (i.e..dcm).- Parameters:
path (str) – The directory path to crawl for files.
subject_name (str) – The name of the subject.
extension (str) – The file extension of the files to be searched.
modality_extractor (ModalityExtractor) – The modality extractor.
organ_extractor (OrganExtractor) – The organ extractor.
annotator_extractor (AnnotatorExtractor) – The annotator extractor.
- execute()[source]#
Execute the crawling process.
- Returns:
The crawled data.
- Return type:
Tuple[FileSeriesInfo, …]
- class DatasetFileCrawler(path, extension, modality_extractor, organ_extractor, annotator_extractor)[source]#
Bases:
CrawlerAn iterable crawler for retrieving
FileSeriesInfoentries from a dataset directory containing at least one subject directory with image files of a specified type (seeextensionparameter).If you want to load a large dataset with many subjects, we recommend using the iterative crawling approach instead of crawling the data via
execute()to reduce memory consumption (see example below).Important
The DICOM format is not supported by this crawler. Use the appropriate crawler variant instead.
Example
Demonstration of the iterative and the non-iterative loading approach:
>>> from pyradise.data import (Modality, Organ, Annotator) >>> from pyradise.fileio import (DatasetFileCrawler, ModalityExtractor, >>> OrganExtractor, AnnotatorExtractor, SubjectLoader) >>> >>> >>> # An example modality extractor >>> class MyModalityExtractor(ModalityExtractor): >>> >>> def extract_from_dicom(self, path: str) -> Optional[Modality]: >>> return None >>> >>> def extract_from_path(self, path: str) -> Optional[Modality]: >>> file_name = os.path.basename(path) >>> if 't1' in file_name: >>> return Modality('T1') >>> elif 't2' in file_name: >>> return Modality('T2') >>> else: >>> return None >>> >>> >>> # An example organ extractor >>> class MyOrganExtractor(OrganExtractor): >>> >>> def extract(self, path: str) -> Optional[Organ]: >>> file_name = os.path.basename(path).lower() >>> if 'brainstem' in file_name: >>> return Organ('Brainstem') >>> elif 'tumor' in file_name: >>> return Organ('Tumor') >>> else: >>> return None >>> >>> >>> # An example annotator extractor >>> class MyAnnotatorExtractor(AnnotatorExtractor): >>> >>> def extract(self, path: str) -> Optional[Annotator]: >>> file_name = os.path.basename(path).lower() >>> if 'example_expert' in file_name: >>> return Annotator('ExampleExpert') >>> return None >>> >>> >>> def main_iterative_crawling(dataset_path: str) -> None: >>> extension = '.nii.gz' >>> >>> # Create the crawler >>> crawler = DatasetFileCrawler(dataset_path, extension, MyModalityExtractor(), >>> MyOrganExtractor(), MyAnnotatorExtractor()) >>> >>> # Use the crawler iteratively (more memory efficient) >>> for series_info in crawler: >>> subject = SubjectLoader().load(series_info) >>> # Do something with the subject >>> print(subject.get_name()) >>> >>> >>> def main_crawling_using_execute_fn(dataset_path: str) -> None: >>> extension = '.nii.gz' >>> >>> # Create the crawler >>> crawler = DatasetFileCrawler(dataset_path, extension, MyModalityExtractor(), >>> MyOrganExtractor(), MyAnnotatorExtractor()) >>> >>> # Use the crawler with the execute function >>> # (all series info entries are loaded in one step) >>> series_infos = crawler.execute() >>> >>> # Iterate over the series infos >>> for series_info in series_infos: >>> subject = SubjectLoader().load(series_info) >>> # Do something with the subject >>> print(subject.get_name())
- Raises:
ValueError – If the
extensionparameter specifies the DICOM file extension (i.e..dcm).- Parameters:
path (str) – The dataset directory path to crawl for data.
extension (str) – The file extension of the image files to be crawled.
modality_extractor (ModalityExtractor) – The modality extractor.
organ_extractor (OrganExtractor) – The organ extractor.
annotator_extractor (AnnotatorExtractor) – The annotator extractor.
- execute()[source]#
Execute the crawling process.
- Returns:
The crawled data.
- Return type:
Tuple[Tuple[FileSeriesInfo, …], …]
- class SubjectDicomCrawler(path, modality_extractor=None, modality_config_file_name='modality_config.json', write_modality_config=False)[source]#
Bases:
CrawlerA crawler for retrieving
DicomSeriesInfoentries from a subject directory containing DICOM files (e.g. DICOM images, DICOM registrations, DICOM RTSS). Files of other formats then DICOM will be ignored and can not be crawled with this type of crawler.The
SubjectDicomCrawleris used for searching appropriate files within a specific subject directory containing all the subject’s data. If there are multiple subjects in separate directories but within a common top-level directory to be crawled we recommend using theDatasetDicomCrawler.The prioritized method to extract the
Modalityfor the retrieved images is the usage of a modality configuration file. If no modality configuration file is available theSubjectDicomCrawlerwill try to extract theModalityfrom the retrieved images using the class:ModalityExtractor. If noModalityExtractoris provided an exception will be raised.The
SubjectDicomCrawlercan be used to generate the modality configuration file skeleton for a specific subject. In this case set thegenerate_modality_configparameter toTrueand execute the crawling process. The generated modality configuration file skeleton will be saved in the subject directory.Important
This crawler exclusively support the DICOM file format and does not support any other file format.
- Parameters:
path (str) – The subject directory path to crawl.
modality_extractor (Optional[ModalityExtractor]) – The modality extractor (default: None).
modality_config_file_name (str) – The file name for the modality configuration file within the subject directory (default: modality_config.json).
write_modality_config (bool) – If True writes the modality configuration retrieved to the subject directory (default: False).
- execute()[source]#
Execute the crawling process to retrieve the
DicomSeriesInfoentries.- Returns:
The retrieved
DicomSeriesInfoentries.- Return type:
Tuple[DicomSeriesInfo, …]
- class DatasetDicomCrawler(path, modality_extractor=None, modality_config_file_name='modality_config.json', write_modality_config=False)[source]#
Bases:
CrawlerA crawler for retrieving
DicomSeriesInfoentries from a dataset directory containing at least one subject directory with DICOM files (e.g. DICOM images, DICOM registrations, DICOM RTSS). Files of other formats then DICOM will be ignored and can not be crawled with this type of crawler.The
DatasetDicomCrawleris used for searching appropriate files within a specific dataset directory containing at least one subject folder with DICOM files. If there is just one subject in a single directory to be crawled we recommend using theSubjectDicomCrawler. If you want to load a large dataset with many subjects, we recommend using the iterative crawling approach instead of crawling the data viaexecute()to reduce memory consumption (see example below).The prioritized method to extract the
Modalityfor the retrieved images is the usage of a modality configuration file. If no modality configuration file is available for a specific subject directory theDatasetDicomCrawlerwill try to extract theModalityfrom the retrieved subject images using theModalityExtractor. If noModalityExtractoris provided an exception will be raised.The
DatasetDicomCrawlercan be used to generate the modality configuration file skeletons for all subjects in the dataset directory. In this case set thegenerate_modality_configparameter toTrueand execute the crawling process. The generated modality configuration file skeletons will be saved in the appropriate subject directories.Important
This crawler exclusively support the DICOM file format and does not support any other file format.
Example
Demonstration of the iterative and the non-iterative loading approach:
>>> from pyradise.fileio import (DatasetDicomCrawler, SubjectLoader) >>> >>> >>> def main_iterative_crawling(dataset_path: str) -> None: >>> # Create the crawler (using the modality configuration file) >>> crawler = DatasetDicomCrawler(dataset_path) >>> >>> # Use the crawler iteratively (more memory efficient) >>> for series_info in crawler: >>> subject = SubjectLoader().load(series_info) >>> # Do something with the subject >>> print(subject.get_name()) >>> >>> >>> def main_crawling_using_execute_fn(dataset_path: str) -> None: >>> # Create the crawler (using the modality configuration file) >>> crawler = DatasetDicomCrawler(dataset_path) >>> >>> # Use the crawler with the execute function >>> # (all series info entries are loaded in one step) >>> series_infos = crawler.execute() >>> >>> # Iterate over the series infos >>> for series_info in series_infos: >>> subject = SubjectLoader().load(series_info) >>> # Do something with the subject >>> print(subject.get_name())
- Parameters:
path (str) – The dataset directory path to crawl.
modality_extractor (Optional[ModalityExtractor]) – The modality extractor (default: None)
modality_config_file_name (str) – The file name for the modality configuration file within the subject directory (default: modality_config.json).
write_modality_config (bool) – If True writes the modality configuration retrieved to the subject directory (default: False).
- execute()[source]#
Execute the crawling process to retrieve the
DicomSeriesInfoentries.- Returns:
- The retrieved
DicomSeriesInfo entries.
- The retrieved
- Return type:
Tuple[Tuple[DicomSeriesInfo, …], …]