Labs' Darin McBeath has made AnnotationQuery for Spark available as Open Source on GitHub. AnnotationQuery provides a suite of composable functions to query annotations stored as a parquet file. While the annotations will typically be generated by popular text analytic tools such as Stanford Core or Genia, the only requirement is the annotations adhere to the AQAnnotation structure. The underlying implementation leverages Datasets and Spark(SQL).
Implemented functions include ContainedIn, Contains, Sequence, Between, Preceding, Following, Before, and After. More information is available at https://github.com/elsevierlabs-os/AnnotationQuery
We have been using this library within Elsevier Labs to analyze the billions of annotations we have developed from our content. We hope you find it useful.