Skip to content

Releases: UW-COSMOS/Cosmos

v0.7.1

01 May 20:53
637d6df
Compare
Choose a tag to compare

With some of the build updates, CPU fixes, and reasonable docker image base.

v0.7.0

30 Apr 18:13
31a48f2
Compare
Choose a tag to compare
  • Standalone COSMOS service
  • New method of equation detection

Change base image

31 Mar 16:02
Compare
Choose a tag to compare

The previous base image was deprecated. Switching to nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu22.04 as base.

v0.6.1 - Minor table extraction fix

07 Mar 17:59
Compare
Choose a tag to compare
  • Fixed a bug where empty parquet files stopped all table extraction processing.

Table extraction, HTCosmos

27 Feb 21:41
Compare
Choose a tag to compare

New:

  • Inclusion of table extraction (via --extract-tables option on ingest_documents script)
  • HTCosmos - run COSMOS pipeline in a high-throughput mode on an HTCondor cluster

Table context enrichment, text normalization, and fixes

10 Aug 19:14
Compare
Choose a tag to compare
  • Table context enrichment during ingestion. Enabling (via the --use-table-context-enrichment option on the ingest CLI) will match detected tables to mentions within the body text, adding a context_from_text field to the output parquet.

  • The retrieval API has been updated to search either:

    • local_content field (default) - the text content of the table and its associated caption, if any
    • full_content field - local_content plus context_from_field
    • Any of the three fields separately (content, caption_content, context_from_text)
  • Text normalization. Enabling (via the --use-text-normalization option on the ingest CLI) will do basic unicode normalization to regularize ligature usage and mojibake issues from the text layer.

  • ASKE-ID lookup within the retrieval API.

v0.4.0 - New weights; retrieval API updates

16 Feb 21:32
23a7cc5
Compare
Choose a tag to compare
  • New weights including a newer set of annotations
  • Added a few necessary files for training detection + postprocessing.
  • API key requirement added (though currently disabled)
  • Document level lookups and filters
  • Filter by dataset_id
  • Store and filter on object size
  • Concatenate contents and header_content field into one full_contents field and use that for retrieval

v0.3.0 - New pipeline, entity linking and semantic context for tables

04 Dec 18:57
4aa562e
Compare
Choose a tag to compare
  • Modular pipeline with new workflow definitions, cli, unicode (#122)
  • Initial entity linking using SciSpacy (#135)
    • Entity recognition + linking to UMLS entities
  • Initial semantic context for tables (#137)
  • (ongoing) documentation to match

Optimization and connected components update

06 May 19:45
b40ef3c
Compare
Choose a tag to compare
  • Remove equation2latex
  • Move merging to run.py
  • Remove extra call to tesseract from list2html
  • Add handling of margin objects in connected components

Attentive RCNN and updates to ingestion

15 Apr 18:41
Compare
Choose a tag to compare

Update document segmentation model with Attentive RCNN model.