Crawling Module#

Module: pyradise.fileio.crawling

General#

The crawling module provides functionality to search for loadable data in a filesystem hierarchy and to construct intermediate information (i.e. SeriesInfo and subclasses) for enabling the subject construction in the data loading process. The advantage of this intermediate loading step is that unnecessary data does not need to be loaded and can be skipped before the loading process (see Selection Module). This is especially useful because working with imaging data is often memory intensive. Furthermore, the loading of DICOM data may involve conversion steps (i.e. in case or available registration files or DICOM-RTSS) which are time consuming and can be omitted using this intermediate step.

The crawling module provides separate crawlers for discrete image files and for DICOM data because we assume that the user is either working with discrete image files or with DICOM data. For both input data types crawlers for single subject directory or multi-subject datasets are provided. Single subject directory crawlers require that the data of the subject is contained in a single folder including sub-folders. For multi-subject datasets each subject must have its own directory at the top-most hierarchy level. For the dataset crawlers a execute-at-once and an iterative approach is provided. The iterative approach is especially useful if the user wants to process the data sequentially and wants to keep the memory footprint low. On the other hand, the execute-at-once approach may be useful if the user wants to analyse the data in parallel.

Due to the fact that not all information necessary for Subject creation is available in the file content the crawlers provide interfaces for information retrieval methods. The discrete image file crawlers (recognized by the word File in their name) provide interfaces for three types of Extractor s (i.e. ModalityExtractor, OrganExtractor, and AnnotatorExtractor) which need to be implemented by the user for its specific task. Typically, the Extractor s use parts of the file name to retrieve the necessary information (e.g. the modality from the file name or from a lookup table). Because DICOM data is more structured than discrete file formats and the information about the annotator and the organs can be accessed directly in the DICOM-RTSS, DICOM crawlers provide just interfaces for retrieving the modality information of DICOM image data. This is essential when working with subject data consisting of multiple images from the same modality because DICOM provides just minimal information about the imaging modality such as for example MR for all types of MR-sequences. This minimal information may not be sufficient in many radiotherapy applications because working with different MR-sequences is common and a discrimination between the MR-sequences is evident to feed the different MR-sequences in the correct order through a processing pipeline or a DL-model. Thus, the user must have a mechanism to distinguish between the different uni-modal images. The DICOM crawlers provide two separate mechanisms to retrieve detailed modality information. The first and prioritized approach is using a modality configuration file (see modality_config for more details) which stores a persistent mapping between the DICOM SeriesInstanceUID its modality. The skeleton of this file can be generated automatically with the appropriate crawler and needs to be modified accordingly by the user. The second approach is using a user-defined ModalityExtractor which extracts the necessary modality details directly from the DICOM file content. Both approaches provide the same functionality but have distinctive advantages. While the modality configuration file approach may be more convenient for recurrent work on the same data the extractor approach may be better suited for building deployable solutions for which the modality details can be retrieved rule-based. The selection of the appropriate approach is up to the user.

Data Structure & Hierarchy#

Data Structure for Subject Crawlers

<subject_dir>
├── <file_0>
├── <file_1>
└── ...

Data Structure for Dataset Crawlers

<dataset_dir>
├── <subject_0>
│   ├── <file_0>
│   ├── <file_1>
│   └── ...
├── <subject_1>
│   ├── <file_0>
│   ├── <file_1>
│   └── ...
└── ...

Class Overview#

The following Crawler classes are provided by the crawling module:

Class

Description

Crawler

Base class for all Crawler subclasses

SubjectFileCrawler

Crawler class for discrete image files in a single subject directory

DatasetFileCrawler

Crawler class for discrete image files in a dataset directory

SubjectDicomCrawler

Crawler class for DICOM files in a single subject directory

DatasetDicomCrawler

Crawler class for DICOM files in a dataset directory

Details#

class Crawler(path)[source]#

Bases: ABC

An abstract crawler whose subtypes are intended to be used for searching files of a certain type in a specified location or within a hierarchy of directories.

Parameters:

path (str) – The directory path for which the crawling will be performed.

abstract execute()[source]#

Execute the crawling process.

Returns:

The crawled data.

Return type:

Any

class SubjectFileCrawler(path, subject_name, extension, modality_extractor, organ_extractor, annotator_extractor)[source]#

Bases: Crawler

A crawler for retrieving FileSeriesInfo entries from a subject directory containing discrete image files of a specified type (see extension parameter).

The SubjectFileCrawler is used for searching appropriate files within a specific subject directory containing all the subject’s data. If there are multiple subjects in separate directories but within a common top-level directory to be crawled we recommend using the DatasetFileCrawler.

Important

The DICOM format is not supported by this crawler. Use the appropriate crawler variant instead.

Raises:

ValueError – If the extension parameter specifies the DICOM file extension (i.e. .dcm).

Parameters:
  • path (str) – The directory path to crawl for files.

  • subject_name (str) – The name of the subject.

  • extension (str) – The file extension of the files to be searched.

  • modality_extractor (ModalityExtractor) – The modality extractor.

  • organ_extractor (OrganExtractor) – The organ extractor.

  • annotator_extractor (AnnotatorExtractor) – The annotator extractor.

execute()[source]#

Execute the crawling process.

Returns:

The crawled data.

Return type:

Tuple[FileSeriesInfo, …]

class DatasetFileCrawler(path, extension, modality_extractor, organ_extractor, annotator_extractor)[source]#

Bases: Crawler

An iterable crawler for retrieving FileSeriesInfo entries from a dataset directory containing at least one subject directory with image files of a specified type (see extension parameter).

If you want to load a large dataset with many subjects, we recommend using the iterative crawling approach instead of crawling the data via execute() to reduce memory consumption (see example below).

Important

The DICOM format is not supported by this crawler. Use the appropriate crawler variant instead.

Example

Demonstration of the iterative and the non-iterative loading approach:

>>> from pyradise.data import (Modality, Organ, Annotator)
>>> from pyradise.fileio import (DatasetFileCrawler, ModalityExtractor,
>>>                              OrganExtractor, AnnotatorExtractor, SubjectLoader)
>>>
>>>
>>> # An example modality extractor
>>> class MyModalityExtractor(ModalityExtractor):
>>>
>>>     def extract_from_dicom(self, path: str) -> Optional[Modality]:
>>>         return None
>>>
>>>     def extract_from_path(self, path: str) -> Optional[Modality]:
>>>         file_name = os.path.basename(path)
>>>         if 't1' in file_name:
>>>             return Modality('T1')
>>>         elif 't2' in file_name:
>>>             return Modality('T2')
>>>         else:
>>>             return None
>>>
>>>
>>> # An example organ extractor
>>> class MyOrganExtractor(OrganExtractor):
>>>
>>>     def extract(self, path: str) -> Optional[Organ]:
>>>         file_name = os.path.basename(path).lower()
>>>         if 'brainstem' in file_name:
>>>             return Organ('Brainstem')
>>>         elif 'tumor' in file_name:
>>>             return Organ('Tumor')
>>>         else:
>>>             return None
>>>
>>>
>>> # An example annotator extractor
>>> class MyAnnotatorExtractor(AnnotatorExtractor):
>>>
>>>     def extract(self, path: str) -> Optional[Annotator]:
>>>         file_name = os.path.basename(path).lower()
>>>         if 'example_expert' in file_name:
>>>             return Annotator('ExampleExpert')
>>>         return None
>>>
>>>
>>> def main_iterative_crawling(dataset_path: str) -> None:
>>>     extension = '.nii.gz'
>>>
>>>     # Create the crawler
>>>     crawler = DatasetFileCrawler(dataset_path, extension, MyModalityExtractor(),
>>>                                  MyOrganExtractor(), MyAnnotatorExtractor())
>>>
>>>     # Use the crawler iteratively (more memory efficient)
>>>     for series_info in crawler:
>>>         subject = SubjectLoader().load(series_info)
>>>         # Do something with the subject
>>>         print(subject.get_name())
>>>
>>>
>>> def main_crawling_using_execute_fn(dataset_path: str) -> None:
>>>     extension = '.nii.gz'
>>>
>>>     # Create the crawler
>>>     crawler = DatasetFileCrawler(dataset_path, extension, MyModalityExtractor(),
>>>                                  MyOrganExtractor(), MyAnnotatorExtractor())
>>>
>>>     # Use the crawler with the execute function
>>>     # (all series info entries are loaded in one step)
>>>     series_infos = crawler.execute()
>>>
>>>     # Iterate over the series infos
>>>     for series_info in series_infos:
>>>         subject = SubjectLoader().load(series_info)
>>>         # Do something with the subject
>>>         print(subject.get_name())
Raises:

ValueError – If the extension parameter specifies the DICOM file extension (i.e. .dcm).

Parameters:
  • path (str) – The dataset directory path to crawl for data.

  • extension (str) – The file extension of the image files to be crawled.

  • modality_extractor (ModalityExtractor) – The modality extractor.

  • organ_extractor (OrganExtractor) – The organ extractor.

  • annotator_extractor (AnnotatorExtractor) – The annotator extractor.

execute()[source]#

Execute the crawling process.

Returns:

The crawled data.

Return type:

Tuple[Tuple[FileSeriesInfo, …], …]

class SubjectDicomCrawler(path, modality_extractor=None, modality_config_file_name='modality_config.json', write_modality_config=False)[source]#

Bases: Crawler

A crawler for retrieving DicomSeriesInfo entries from a subject directory containing DICOM files (e.g. DICOM images, DICOM registrations, DICOM RTSS). Files of other formats then DICOM will be ignored and can not be crawled with this type of crawler.

The SubjectDicomCrawler is used for searching appropriate files within a specific subject directory containing all the subject’s data. If there are multiple subjects in separate directories but within a common top-level directory to be crawled we recommend using the DatasetDicomCrawler.

The prioritized method to extract the Modality for the retrieved images is the usage of a modality configuration file. If no modality configuration file is available the SubjectDicomCrawler will try to extract the Modality from the retrieved images using the class:ModalityExtractor. If no ModalityExtractor is provided an exception will be raised.

The SubjectDicomCrawler can be used to generate the modality configuration file skeleton for a specific subject. In this case set the generate_modality_config parameter to True and execute the crawling process. The generated modality configuration file skeleton will be saved in the subject directory.

Important

This crawler exclusively support the DICOM file format and does not support any other file format.

Parameters:
  • path (str) – The subject directory path to crawl.

  • modality_extractor (Optional[ModalityExtractor]) – The modality extractor (default: None).

  • modality_config_file_name (str) – The file name for the modality configuration file within the subject directory (default: modality_config.json).

  • write_modality_config (bool) – If True writes the modality configuration retrieved to the subject directory (default: False).

execute()[source]#

Execute the crawling process to retrieve the DicomSeriesInfo entries.

Returns:

The retrieved DicomSeriesInfo entries.

Return type:

Tuple[DicomSeriesInfo, …]

class DatasetDicomCrawler(path, modality_extractor=None, modality_config_file_name='modality_config.json', write_modality_config=False)[source]#

Bases: Crawler

A crawler for retrieving DicomSeriesInfo entries from a dataset directory containing at least one subject directory with DICOM files (e.g. DICOM images, DICOM registrations, DICOM RTSS). Files of other formats then DICOM will be ignored and can not be crawled with this type of crawler.

The DatasetDicomCrawler is used for searching appropriate files within a specific dataset directory containing at least one subject folder with DICOM files. If there is just one subject in a single directory to be crawled we recommend using the SubjectDicomCrawler. If you want to load a large dataset with many subjects, we recommend using the iterative crawling approach instead of crawling the data via execute() to reduce memory consumption (see example below).

The prioritized method to extract the Modality for the retrieved images is the usage of a modality configuration file. If no modality configuration file is available for a specific subject directory the DatasetDicomCrawler will try to extract the Modality from the retrieved subject images using the ModalityExtractor. If no ModalityExtractor is provided an exception will be raised.

The DatasetDicomCrawler can be used to generate the modality configuration file skeletons for all subjects in the dataset directory. In this case set the generate_modality_config parameter to True and execute the crawling process. The generated modality configuration file skeletons will be saved in the appropriate subject directories.

Important

This crawler exclusively support the DICOM file format and does not support any other file format.

Example

Demonstration of the iterative and the non-iterative loading approach:

>>> from pyradise.fileio import (DatasetDicomCrawler, SubjectLoader)
>>>
>>>
>>> def main_iterative_crawling(dataset_path: str) -> None:
>>>     # Create the crawler (using the modality configuration file)
>>>     crawler = DatasetDicomCrawler(dataset_path)
>>>
>>>     # Use the crawler iteratively (more memory efficient)
>>>     for series_info in crawler:
>>>         subject = SubjectLoader().load(series_info)
>>>         # Do something with the subject
>>>         print(subject.get_name())
>>>
>>>
>>> def main_crawling_using_execute_fn(dataset_path: str) -> None:
>>>     # Create the crawler (using the modality configuration file)
>>>     crawler = DatasetDicomCrawler(dataset_path)
>>>
>>>     # Use the crawler with the execute function
>>>     # (all series info entries are loaded in one step)
>>>     series_infos = crawler.execute()
>>>
>>>     # Iterate over the series infos
>>>     for series_info in series_infos:
>>>         subject = SubjectLoader().load(series_info)
>>>         # Do something with the subject
>>>         print(subject.get_name())
Parameters:
  • path (str) – The dataset directory path to crawl.

  • modality_extractor (Optional[ModalityExtractor]) – The modality extractor (default: None)

  • modality_config_file_name (str) – The file name for the modality configuration file within the subject directory (default: modality_config.json).

  • write_modality_config (bool) – If True writes the modality configuration retrieved to the subject directory (default: False).

execute()[source]#

Execute the crawling process to retrieve the DicomSeriesInfo entries.

Returns:

The retrieved DicomSeriesInfo

entries.

Return type:

Tuple[Tuple[DicomSeriesInfo, …], …]