Highlights from code4lib 2017

by Corey A Harper

Code4lib 2017 was hosted by UCLA on March 6-9, 2017. This was the 12th code4lib conference, and was attended by over 450 library technologists. Amazingly, and despite the increased size, the main conference has managed to stay a single track meeting. This contributes to its appeal, as the shared experience fosters a sense of camaraderie and a strong community ethos. It also results in exposure to topics and areas that might be slightly less central to each attendee's specific interests.

For the pre-conference day, I registered for a half-day morning "Introduction to Text Mining" session, and a half-day afternoon "Ally Skills Workshop". The Text Mining workshop was introductory, but was a good speed for most of the attendees in the room. Some of the workshop was done in SpaCy, so was a good second pass through that codebase for me. We got deeper into the entity recognition and word2vec features than I had previously, and I spent some time looking more closely at the code itself.

The second pre-conference was among the highlights of the conference for me. I attended the Ally Skills Workshop in the afternoon, and found it tremendously valuable. This workshop is based on the Ada Initiative's Ally Skills Workshop, which—according to their Website—was updated in 2016 to cover "race, sexuality, disability, age, class, and religion, as well as gender." Though taking the workshop made me feel the loss of the Ada Initiative even more acutely, it's great that the workshop materials live on and this important training continues to be offered in a variety of contexts. My key takeaways from the workshop included:

  • Allyship is an action. It is something you do. Attitudes aren't enough.
  • Allies need to take these actions because studies show that people who are minorities are penalized for engaging in diversity valuing behavior.
  • Don't expect praise or credit for fighting inequality.
  • Follow & support leaders from target groups.
  • Follow your discomfort: find out more and understand why before reacting.

The contrast of these two pre-conferences foreshadowed the range of sessions in the conference itself. Code4lib, once a purely technical conference, has evolved considerably over the years and now includes as many sessions about the social side of code as on the technical. Some of the best material from the conference this year was focused on product and project management, leadership, inclusivity, and community building. This held true of the two keynotes as well.

This year's keynotes were hands down the best part of the conference for me this year. Unlike years past, both of the keynotes came from within the code4lib community. Normally I would prefer at least one keynote coming from an external perspective, but this year internal keynotes felt useful, perhaps because the community, our institutions, and the profession are at an inflection point.

The opening keynote, from Andreas Orphanides, started the conference off along the theme of humanism in technology. Titled "It's made of people: designing systems for humans", Dre's talk discussed the nature of systems and the nature of models of those systems, and included my favorite George Box quote: "All models are wrong, but some are useful." He noted that goals are also models, and that we risk inadvertently focusing on the wrong things if we target a metric without realizing the underlying context. Dre also discussed data as model, and briefly addressed the assumptions and biases inherent in both data and algorithms. It was a very thought provoking talk covering a lot of ground.

The closing keynote, from Christina Harlow, also combined the sociopolitical with the technical in a compelling and engaging way. Christina's talk was accompanied by an interesting technique for audience engagement: she provided a comment-enabled copy of her transcript, and encouraged people to add comments, notes, questions, and discussion both during and after the presentation. It's an great idea for how to generate a conversation around a topic, and her topic was remarkably rich. I can't do justice with a summary, but the gist was that we need to radically rethink our approach to data operations and data engineering in the library community. She drew on Riot Grrrl for inspiration on how this would work. Her conclusion was that the library technology community should radically rethink its approach to collaboration and infrastructure because much of the current way is hindering progress. Along the way, she covered community, ethics & politics, social justice, information security, preservation, documentation, tools, infrastructure, openness, and transparency. Her presentation was the first time I've seen a standing ovation at a code4lib. It was a great talk, and a great way to wrap up an engaging and exhausting conference.

These two great keynotes bookended a phenomenal meeting, and there were too many great presentations to include here. Some of my highlights were:

A useful practical outcome for me was the emergence of a small Spark interest group within the code4lib space. Currently, this consists of myself and about 10 others from DPLA, Stanford, and Temple. It primarily exists as a spark channel in the code4lib slack and periodic (currently bi-weekly) Google Hangout sessions to share code and ideas.

The links to the sessions above should eventually have video of individual presentations. In the meantime, you can find recordings on the Code4lib YouTube page.

Highlights from Int'l Workshop On Mining Scientific Publications (WOSP2016)

by Mike Lauruhn

Co-located with the Joint Conference on Digital Libraries (JCDL), the 5th International Workshop On Mining Scientific Publications took place on June 22-23 at the Newark, New Jersey campus of Rutgers University. An engaged crowd of about 25 listened to paper presentationas and participated in conversation and networking. Twelve papers and demos were presented in addition to keynotes and invited talks. A few themes surfaced in the course of papers presented. One being about how to surface tangible, measurable credit for contributions to research from paper outside of the explicit citations.  

In their paper, "Measuring Scientific Impact Beyond Citation Counts" Robert Patton and his colleagues from the Oak Ridge National Laboratory described the notion of context-aware citation analysis. Their paper cites two gaps that they see in citation counts and measuring impact:  "1. Not all cited works provide an equal contribution to a publication; 2. Not all required resources are provided appropriate credit or even credit at all."

The paper explains that "some cited works are provided merely for reference or background purposes for the reader while other cited works are so critical to the citing work that the citing work would probably not have even existed if not for the existence of the cited work."

On the second point regarding the required resources given appropriate credit: this was pointing out a need for researchers using shared resources (like computing facilities that are unique to an institution such as a National Lab) should be expressly giving credit to that institution as being a contributor to the research output. This is essential on a few fronts. First, it contributes to overall reproducibility initiatives. Secondly, it helps those who maintain such shared resources and services quantify their own scientific contributions and outputs. This later helps with budget decisions. Related to that theme, I presented a paper on behalf of Elsevier Labs and The Arabidopsis Information Resource (TAIR), on additional value that an author receives in terms of citation hits when they use a shared resource such as a Model Organism Database.

They conclude that future work should be focused on a context-aware citation analysis that presents the different values that citation contribute to a citing work and that these different values should factor into measuring impact. Secondly, measuring impact assessment with more detail from full content should be able to reveal the manner in which a research area "begins, grows, and fades" and that stages on that lifecycle could also be used to assess impact.

Also related to the lifecycle of a particular area of research, Shubhanshu Mishra presented a long paper, "Quantifying conceptual novelty in the biomedical literature." In it, the team used MeSH terms to measure and quantify the novelty of MEDLINE articles. Their research found interesting trends in the biomedical domain, where concepts tend to have four phases: a Burn-In, Accelerating Growth, Decelerating Growth, and Constant Growth.

In addition, they found that novelty of an article can also be measured through finding combined concepts. They also attempted to measure an author's novelty across their careers -- finding that novelty goes down over time. Proceedings from WOSP2016 will be in the October/November issue of D-Lib.

ESWC 2016: Keynotes and paper highlights

by Mike Lauruhn

The main program of the 2016 ESWC conference took place on May 31 - June 2, 2016. [for my write-up on the Workshops & Tutorials, please see my previous blog post.] For some reason, it seemed fitting to me personally that the main program for ESWC 2016 was book-ended by a pair of talks about owl:sameas.

Jim Hendler gave the opening keynote with the intriguing title, "Wither OWL in a knowledge-graphed, Linked-Data World?" (Though in the days leading into the conference, he asked that attendees to consider the implications of "wither OWL" and "whither OWL." In his talk, he reminded the audience that the need for ontologies in the real world is increasing and that "On the Web, ontologies are increasingly needed." Shortcomings lie in the manner in which owl:sameas is used incorrectly in mapping data to data and that owl:sameas does not account for a part --> whole relationship, which is need across the gene ontology as well as most medical and health science models. He also cited the lack of temporal reasoning in OWL. Ultimately, he argued that more research needs to go into how to formalize these types of expressions that will be essential for the modern web.

The keynote almost foreshadowed a great paper presentation from by Wouter Beek. In it, Beek addressed issues related to entities considered to be the same, but not all contexts. While acknowledging it would not be expected to anticipate all possible contexts where an entity would be used, Beek and his coauthors propose an alternative semantics for owl:sameAs that creates a hierarchy of subrelations. Within the hierarchy, identity statements depends on the dataset where the statement occurs.

Paper highlights included the eventual Best Paper winner for the In Use & Industry track. Achille Fokoue from IBM presented Tiresias, a framework for predicting Drug-Drug Interactions. The presentation covered the data integration and construction of a knowledge graph, as well as the similarity metrics and prediction that took advantage of the graph. 

Three more papers in the same track caught my attention as the discussed initiatives attempting to bridge gaps that would both encourage generation of Linked data -- Pieter Heyvaert, 'RMLEditor: A Graph-based Mapping Editor for Linked Data Mappings' and Mauro Dragoni, 'Enriching a Small Artwork Collection through Semantic Linking' -- and make Linked Data applications easier to develop, Ali Khalili's 'Adaptive Linked Data-driven Web Components: Building Flexible, Reusable Semantic Web Interfaces'.

ESWC2016 proved to be a busy week of tutorials, workshops, keynotes, and papers. It is quite interesting to see how mature a domain is when conferences are are able to strike a balance between emerging technolgies (tracks dedicated to Reasoning and Machine Learning), a view of opporunities in industry (via the In Use track and keynotes), and reflection on the past foundations.

ESWC 2016 report: sampling the preconference tutorials & workshops

by Mike Lauruhn

ESWC 2016 took place from May 29th, 2016 to June 2n, 2016 in Crete, Greece. The program had lots to offer in a variety of formats including Workshops, Tutorials, Papers across several tracks and specializations, posters and demos, and keynote speakers.

The first two days of the conference offered more than 20 workshops and tutorials. After deliberation, I took advantage of parts of two workshops and two tutorials. First up was, "From linguistic predicate-arguments to Linked Data and ontologies: Extracting n-ary relations," a hands on tutorial from the NLP team at Pompeu Fabra University in Barcelona. The ambitious tutorial efficiently covered a lot of ground. In a nutshell the tutorial was designed to: Introduce NLP tools and resources for identifying predicate-arguments in text; and, provide an overview of models and methods for mapping those arguments into the Semantic Web.

This included a variety of sections, from covering methods and tools for deep linguistic text analysis; representing arguments in RDF/OWL for moving natural language to the Semantic Web; and example applications for relation extraction and evaluation. This gave a solid introductory overview to some of available resources -- including PropBank, VerbNet, and FrameNet. The second half of the tutorial focused on tools and resources. The presenters also took the opportunity to showcase their own relation extraction demo, which allows users to enter in a sentence and view five different annotations via BRAT visualization.

The afternoon of the first day, I was able to attend the tail end of the 1st Workshop on Humanities in the Semantic Web (WHiSe). This included a pair of excellent papers and an intriguing round table on where to go (literally and figuratively) moving forward. While highlighting some specific projects (sorry I missed the paper with the title of the conference -- "Linked Death", a paper on the lifecycle and applications of Linked Open Data World War II death records), the workshop gave some time to discussing the big picture applications of semantic web technologies in processes associated with the humanities and why there is less conversation about research ecosystems (and specifically technologies and protocols) in the sphere of humanities.

The papers I attended were "On the description of process in Digital Scholarship", presented by David De Roure, and "An ecosystem for Linked Humanities Data" from Rinke Hoekstra. De Roure pointed out that the provenance of historical artifacts and their digitization can be represented in PROV. Further, he went on to say that when W3C PROV convened, it did so with use cases that were physical and digital, but that since then, little has been done with physical objects. Rinke expanded on the theme with an elaboration on what the research ecosystem for Linked Humanities data should entail. The paper notes that much of the usable data that is available is a result of a top-down approach -- prominent datasets from large collections. The paper then presents a model for individuals to publish their smaller datasets, and link them to existing vocabularies and other datasets. I would be remiss if I didn't mention Rinke's awesome use of herring and the fishing industry as an illustration of the distribution of data.

The lively round table concluded the workshop with two prominent themes: expanding on what the humanities research ecosystem should be; and how to continue to bridge the semantic web community and the humanities community. In my opinion, the most intriguing question from the first part is "what is the equivalent of bioinformaticist in the Humanities space?" On a more practical matter, the question arose as to what would be the appropriate venue for another WHISE workshop.

On the morning of day two, I attended the Second International Workshop on Semantic Web for Scientific Heritage. Christophe Debruyne was the invited speaker and presented on the objectives and challenges encountered in the Linked Logainm project -- "Publishing and Using an Authoritative Linked Data Dataset of Irish Place Names." The service was designed to help librarians who want authoritative vocabularies that can be integrated with existing bibliographic systems. Linked Logainm was a success in that its concepts and structure are fine-grained and account for nuances of Irish history, geography and county types. The issue that stuck out to me the most was a provenance topic around the intermingling of authoritative data and data (concepts, facts, relations) that are semi-automated and perhaps 'less verified' or have less confidence. Debruyne elaborated on the discussions about how best to capture  and note these or even keeping them in a separate graph. 

Two additional workshop presenters had tons of interesting overlap: Seth van Hooland discussed topic modelling and linguistic annotation framework for modeling the Hebrew Bible; while Anja Weingart presented "Lexicon for Old Occitan medico-botanical terminology in Lemon model" ("Lemon is a proposed model for modeling lexicon and machine-readable dictionaries and linked to the Semantic Web and the Linked Data cloud." http://lemon-model.net/ )


The final stop of the preconference weekend was the LOD Lab tutorial put on by the LOD Laundromat team. The Laundromat is a service that "provides access to all Linked Open Data (LOD) in the world." The Laundromat metaphor refers to the manner in which the data is cleaned with syntax errors, duplicates, and blank nodes removed and represented as N-Triples. For me, the highlight of the tutorial was the hands on dive into LOTUS, a service that allows for searching LOD Laundromat statements based on natural text.

ALA Annual in San Francisco (#alaac15)

by Mike Lauruhn

Following a fun day at the Jane-athon and a team dinner with friends and colleagues, I was ready for the weekend sessions at ALA Annual. Managing an ALA weekend schedule has always meant making decisions about what to attend and acknowledging that one simply cannot attend every session they would like to. For me, this year's case in point was the six Linked Data sessions crammed into to two back-to-back time slots on Saturday morning. Combine this with planned and spontaneous reunions and the allure of an enormous exhibits hall, the weekend is always busy and memorable. Nonetheless, I made my choices and got to hear some intriguing talks from a variety of organizations. With that as a background, here are a pair of highlights from sessions I was able to attend:

The Cataloging Norms Interest Group featured Diane Hillmann's talk about legacy bibliographic metadata issues as use of Library Linked Data, BIBFRAME and RDA are increasing. The talk was quite practical and emphasized the need to hang on to legacy metadata (i.e. MARC records). She stated her belief that new business models for metadata management will likely emerge, with local caches bypassing central cache management. She also noted that there are too many questions about what we (and those who come after us) will want from our legacy metadata. Instead of dwelling on what to keep, share, and map, just keep it ("Storage is cheap").

The ACRL Science & Technology Section hosted a panel about Federal Public Access Plans. The panel was designed to provide some examples of how Federal Agencies are developing plans to support access to publicly-funded research results in response to the 2013 memo from the Office of Science and Technology Policy. This was one of favorite sessions I attended over the weekend and featured two of my favorite talks (neither of which used PowerPoint. Coincidence?). The speakers included Pamela Tripp-Melby from the National Library of Education and Amanda Wilson from the National Transportation Library. For me, it was important and refreshing to hear about data planning and management needs for researchers and agencies outside the scope of health sciences, laboratory, and field sciences.   

Amanda Wilson, Director of the National Transportation Library, discussed the Public Access Plan for the Department of Transportation. She stated that they would not require depositing of data with the DoT, but researchers would need to self-certify a deposit with a repository that meets their standards, mostly focused on access and sustainability. DoT is also requiring ORCID IDs for researchers and DOIs for corresponding datasets. 

Similarly, the Director of the National Library of Education, Pamela Tripp-Melby gave an update on access planning for the Department of Education. For more than 50 years, the Department of Education has had the ERIC database as a source for literature. Perhaps the biggest revelation of the talk was how much exposure researcher's data management plans would receive -- with plans to include links to the DMPs in ERIC.

My favorite audience comment for the session [paraphrased] was: "let's not delay on putting teeth behind these mandates. Don't allow bad habits to continue".

Regardless of venue, ALA Annual is always a fun, inspiring, overwhelming, and enlightening experience. Hundreds of sessions, thousands of vendors -- there really is something for everyone (pardon the cliche). I always encourage anyone to attend. In the meantime, check out the conference scheduler page where many presentations have been uploaded.

Why the Jane-athon -- and the Jane-athon format -- matters

by Mike Lauruhn 

"As a matter of fact, I am registered for the SF #janeathon", the tweet proclaimed. Yes, it was an all-day Jane Austen-related workshop taking place at the American Library Association Annual conference. And no, it was not an endurance competition reading Jane Austen novels or watching their film adaptations. In the official event description, it was stated that "participants will explore the very real issues inherent in using Resource Description Framework (RDF) statements in aggregated “packages” ready to contribute to the Linked Data world."

Over the past five years, I had attended various Linked Data and RDF events at both ALA Annual and ALA Midwinter -- and even participated on panels at such events. The format often followed a familiar template: person representing an organization stands up, shows their organization's vocabularies, shows concepts from the vocabularies marked up as RDF, yada yada yada, they show some amazing mash-ups, visualizations, and UIs that they are thinking about offering. I fear that many people left these with similar sentiment: "I understand Linked Data, I get RDF, we've used vocabularies and authority control forever, I want to be able to offer next gen applications to my users, but what about the middle part? Where is the actual legwork? What will our daily tasks look like? What will the tools be like?" 

The Jane-athon adopted a hackathon format to facilitate a hands-on approach, putting attendees in the driver seat with tools for describing Jane Austen resources with software specifically designed to use RDA.


In the month prior to the event, the attendees were contacted by organizers with instructions leading to a goal of arriving in San Francisco ready to hit the ground running. This included downloading the RIMMF software (RDA in Many Metadata Formats) and familiarizing ourselves to it through tutorials. I was personally impressed with the RIMMF capabilities to access and incorporate trusted vocabulary and authority records from LC and other providers, in addition to providing access to Wikipedia and other resources. We also took a brief survey to help designate our tables and coaches we would be working with. Deborah Fritz hosted a pair of webinars to answer questions that had arisen. Finally, we were given instructions on collecting Jane Austen-related works to bring.

At the opening of the Jane-athon, Gordon Dunsire and Deborah Fritz welcomed attendees and introduced fellow organizers, coaches, and team leaders. We sat at assigned tables based upon interests of work types: print books; audio & ebooks; sequels, prequels, & spin-offs; criticism; film & television; etc. The stated goal was to "have fun" and "make as many Jane records as possible for exporting to RDF for experimentation." After the intro and overview, nearly two solid hours were allotted to RIMMFing. It was fun to see all the interaction at the various table: Coaches and team leaders making the rounds; people looking over shoulders; participants helping each other out. Individuals imported data, established relationships between entities and descriptions in metadata records. Tables tackled and shared a variety of questions about representing relationships, translations, spin-offs, and compilations. In the end, just over 400 Jane Austen records were created to be curated and added to the rballs site. (R-balls "contain linked data and Semantic Web representations of cultural heritage resources".) 

In the afternoon, there was a discussion about experiences from the morning and a chance to address questions that arose. Then, Diane Hillmann and Jon Phipps led a discussion on the implications of RDA for Linked Data in libraries. What the conversation, and the day as a whole, confirmed to me was that the tools and expertise are maturing and that Linked Open Data adoption is less about technology and more about governance and trust.

Putting on a workshop is hard. Hands-on ones are even harder. And getting a large diverse group of attendees ready and able to hit the ground running is an enormous feat. The core organizers** of the Jane-athon should be commended for the time and effort in putting together another Jane-athon. Here's hoping the model can be replicated.

(** Gordon Dunsire, JSC; Diane Hillmann and Jon Phipps, Metadata Management Associates; Deborah Fritz and Richard Fritz, TMQ; and James Hennelly, ALA and the RDA Toolkit.)

Report from NASKO 2015 at UCLA

by Mike Lauruhn

UCLA's Royce Hall was the setting of the biennial North American Symposium on Knowledge Organization (#NASKO2015) on June 18-19, with the Department of Information Studies serving as host. Over the two days of the symposium, the variety and range of papers was impressive and represented the many different ways that the field of Knowledge Organization is approached. The papers included project overviews and status updates, papers heavy in metatheory and philosophy, and papers about history and contributions to the field.

Some highlights: 
Hilary Thorsen of Stanford University Libraries presented Ontologies in the Time of Linked Data. In it, she gave an overview of the popular Linked Jazz project as and example of development of a domain-specific ontology for Linked Open Data (LOD) applications. She talked through the steps of approaching the music domain and engineering an ontology for Linked Data while expressing the important elements and relations for the domain of jazz musicians and their networks. The presentation was very insightful, frank, and practical.

Rebecca Green from OCLC's Dewey Services team spoke about how the complex relations between indigenous peoples in the United States and the United States government are (or aren't) manifest in DDC. She argues that many of the instances that people point out are a result of misunderstandings, but others are areas where the DDC needs to consider improvement. Some of the complexities are a result (or reflection) of treaties, designation recognized tribes, and the sovereignty of reservations - which inconsistent conflict or coincide with how geographic locations from elsewhere in the schema. She also cited previous criticism of treatment of indigenous peoples and where improvements have been made. It was a thought provoking talk that helps highlight what can happen when schemas reinforce stereotypes. Later in the symposium --through his papers and conversation -- UCLA's Greg Leazer highlighted the need to address culture, language, and ethics in KO work.

Murtha Baca and Melissa Gill from the Getty Center discussed their paper Encoding Multilingual Knowledge Systems in the Digital Age: the Getty Vocabularies. It focused on the Getty vocabulary properties (Art & Architecture Thesaurus (AAT®), the Getty Thesaurus of Geographic Names (TGN®), and the Union List of Artist Names (ULAN®)) as schemas that are evolving in hopes of finding their potential to play key roles in the the Linked Open Data environment. The talk specifically addressed the processes that the Getty undertakes to develop the cross-cultural and multi-lingual aspects that make the their schemas valuable. Among the challenges beyond translations is addressing homographs that have entirely different meanings across cultures (Retablo as an example when compared as objects in Spain vs. Latin America). A nice presentation that would be insightful to members of the LODLAM community. 

Every talk was greeted with insightful questions and constructive criticism. The symposium remains an important format for bringing together diverse approaches and aspects to a field that is consistently in transition, but still relies heavily on its legacy. Looking forward to future activities from this community.

Recap of June 16 Cincinnati Spark Meetup

by Curt Kohler & Darin McBeath

At this month's Cincinnati Spark Meetup, Doug Needham from Illumination Works presented some background on graph theory and then we dug into the some code examples using the Spark GraphX library. There were a lot of great examples, discussion and interest in graphs. After the presentation we reviewed some of the highlights from the recent Spark 1.4 release (SparkR), reflected on some of the announcements from Spark Summit West, and had a question & answer session for the edX course Introduction to Big Data with Apache Spark that many members are taking. 

The Meetup continues to increase in popularity. After only 5 months we have grown to include over 90 members that want to learn more about the open-source cluster computing platform. The June 16th Meetup had about 25 attendees and we already have presenters lined up to discuss SparkR at the next Meetup -- targeted for late August. It is exciting to be a part of such a vibrant community where the members contribute and want to learn about Spark.

Cincinnati Spark Meetup Wrap-up

Cincinnati Spark Meetup, April 15, 2015.

Cincinnati Spark Meetup, April 15, 2015.

by Curt Kohler & Darin McBeath

The Cincinnati Spark Meetup continues to attract new members at a brisk pace. After only 3 months we have grown to include over 70 members who are interested in learning about this compelling new technology platform. On April 15th, 2015, about 30 members gathered at a local business for our latest Meetup.  During the first half of the evening, we watched a video of Matei Zaharia’s Spark Summit East keynote address and were joined via phone by Michael Armbrust (lead developer for Spark SQL at Databricks) for a Q&A session on Spark 1.3.0 and the new DataFrames APIs. The second half of the evening provided the opportunity to participate in a beginner’s “hands-on” programming session with Spark with experienced members available for help.  The session provided sample data, a set of questions about the data to answer using typical Spark processing patterns, and solutions to the exercises in Scala, Python, and Java. 

After reflecting on the meeting, there were a few items that really stood out:

·      Spark is definitely a trending, hot tech topic. One group at the Meetup had made a 2 hour drive from Lexington, Kentucky to learn more about the platform. Even in the midwest, there are companies releasing production processes leveraging Spark.

·      DataFrames will be the API of the future for Spark.  The named column programming paradigm makes the code much less cryptic than using RDD positional offsets you typically encounter with lower level Spark APIs. The DataFrame approach also reduces the amount of code that you need to write, as the optimizer handles much of the data remapping needed to do multiple joins, etc.

Finally, for those who are interested, the material from the programming session is available in an S3 bucket and can be found here: http://cincy-spark.s3.amazonaws.com/