Motivation
For a reason that I'm not disclosing, I had the opportunity to download some 登記所備付地図データ. While I ended up not using my code, I thought it would be a great way to:
- demonstrate how I write my code
- help people in a similar situation
- explain how I cannot bring myself to like the example implementation of the converter of the XML this data is in, to GeoJSON, provided by the DA.
Why did you write this in English?
More on that later.
Summary of what my code does
- Get a list of files you might want to download using their API
- Some filtering on the client side as I couldn't (or at least didn't bother to) refine the query.
- you can read the official documentation here, but this is kind of unfriendly (e.g., suppose you want to find details on
api/3/action/resource_create
, there is a line break betweenresource
and_create
, so you can't simply jump to the section you need) so I went for this to grasp the gist.
- As of 2024, the zip files that you can obtain from the website are constructed as follows:
top.zip
├───0123456789.zip
│ └───0123456789.xml
…
- First download the zip file and retain it in memory, and then unzip the file itself and each item within it until you hit an XML file.
Choosing the Language
Identifying the Bottleneck and Key Features for Your Task
When selecting a language for your task, one must consider the following criteria:
-
Enhanced Support for Parallelism
- Modern CPUs are equipped with multiple cores, thus it is advantageous to unzip multiple files simultaneously.
- Efficient memory usage is paramount. Where possible, utilise streaming to process parts of an object sequentially, rather than loading the entire object into memory.
- Network bandwidth may well prove to be your primary bottleneck.
- Download multiple files concurrently to maximise the utilisation of available bandwidth.
-
Portability and Operational Ease
- Who is running the program?
- Chances are you are not the only person running the program. It could be the operations team, or it could even be another developer as a dependency. Unless you want issues to be made or unsolicited midnight calls from the operations team, it is prudent to minimise the risk of runtime errors.
- A standalone, single binary is ideal as it prevents issues such as missing DLL errors.
- Chances are you are not the only person running the program. It could be the operations team, or it could even be another developer as a dependency. Unless you want issues to be made or unsolicited midnight calls from the operations team, it is prudent to minimise the risk of runtime errors.
- Who will use your code?
- Consider the potential for reusing your code for different tasks. Languages that share a runtime (e.g., JVM for Java, Kotlin, Scala) enable you to use a package written in one language in another. This can influence your language choice.
- Where will the program run?
- It could be on a Windows client, a Linux-based VM, or even a container. Modern languages typically impose few restrictions on runtime environments, yet this remains an important consideration.
- Who is running the program?
-
Preferably a Statically Typed, Compiled Language
- Dynamically typed languages can be beneficial for very small projects, but their sweet spot is often so narrow that they are rarely worth considering. Statically typed languages today are not excessively verbose and offer numerous benefits.
- The key lies in guaranteed type safety. While type annotations or hints from an IDE or plugin can be helpful, statically determined types allow the IDE and plugins to perform far more effectively.
- Even if the project is small now, it may expand rapidly. Remember, once a prototype is created, it often becomes the baseline. Management may not allocate resources to rewrite the prototype in a more robust language. This is a common reality.
- As previously mentioned, if you wish to minimise the risk of encountering runtime errors, it is advisable to opt for statically typed languages. In such languages, many issues that would result in runtime errors in dynamically typed languages can be detected during the compilation process.
- Also in the compilation process, codes that are not referenced at all are often removed, making the final product smaller in size.
-
Availability of Libraries/SDKs
- If you are uploading files to a public cloud, employing the provider's SDK can simplify your task. While not mandatory, it can make handling API changes easier; simply upgrade to the latest SDK to address breaking changes.
Which language is it then?!
- Node.js / Python:😕
-
You may employ Node.js or Python for this task, but they were originally designed to be single-threaded. Although it is possible to achieve multi-threading by utilising the Worker class or the multiprocessing module, this approach adds considerable complexity.
- Furthermore, in the case of Node.js, the absence of synchronisation primitives can quickly corner you.
- Although Python enjoys considerable popularity, it has gained a reputation for frequently introducing breaking changes as a whole. Due to its nature as an interpreted language, these changes often manifest only when the code is executed, resulting in runtime errors.
-
Given the reasons outlined, unless you have a particular need for specific libraries or frameworks that align with your goals, there is little justification for choosing these languages if you are beginning the project from scratch.
-
- Go:🤩
- Go is precisely designed for these types of IO-intensive tasks. It features its own scheduler, enabling each "thread" (termed a goroutine) to operate independently of the OS thread. Owing to this architecture, one can initiate a goroutine for each IO task without requiring what in other languages would be the await keyword. The await keyword, incidentally, instructs the runtime to convert the current process into a state machine, suspend the said process, and release the OS thread on which it was scheduled. Quite a complex operation. However, thanks to Go’s scheduler, the Go runtime manages this seamlessly without the need for explicit declarations in the code.
- It is a statically typed language that can produce a single binary. (I have already mentioned the benefits of a statically typed language, and a single binary above)
- Java (includes Scala, Kotlin)🤩
- C# 😊
- These are general-purpose statically-typed languages with native support for parallelism and concurrency, including multiple threads and synchronisation primitives. They are not specifically designed for IO-intensive tasks, which means your code might appear less elegant with the await keyword scattered throughout. However, this is also what makes them exceptional; if you wish to extend the codebase and create a pipeline that processes the said zip file as input, you can leverage the extensive collection of libraries and frameworks developed for the JVM and .NET ecosystems.
- Although your library in Java can be imported into various popular languages such as Scala and Kotlin, a library written in C# will most likely be utilised solely within C#. This is because neither Visual Basic nor F# are particularly popular choices among developers.
- While both Java and C# are excellent languages, they do not, by default, compile to a single binary that can run independently of the runtime.
To conclude, this leads me to Go. It could quite easily have been Java or another JVM language, but I was incidentally learning Go, and I had no plan to expand the code anyway.
Give me the code already
- I'm getting tired too, let me go frank from here on.
- I ended up with this. It's littered with comments and
fmt.Printf()
's here and there.
If you don't like it, delete them. - As so stated in the comment, you can create your own implementation of processing the downloaded data that is passed in the form of
io.ReadCloser
and simply swap functions.- easier to extend!
- Since this is a personal program that is trivial in terms of the size, I decided to simply share the source code rather than turn it into a package.
package main
import (
"archive/zip"
"bytes"
"encoding/json"
"fmt"
"io"
"log"
"net/http"
"os"
"strings"
"sync"
)
const MAX_PARALLEL_FETCH_AND_UNZIP = 1
const MAX_PARALLEL_WRITE = 100
func main() {
fmt.Printf("Starting... \n")
pipeline := toDisk(fetchXML(queryCkan()))
for s := range pipeline {
fmt.Printf("%v", s)
}
}
// tbh can be a simple slice if you don't have to deal with pagination
func queryCkan() chan string {
ch := make(chan string, 10000)
fmt.Println("querying CKAN")
go func() {
defer fmt.Printf("queryCkan channel is closed!\n")
defer close(ch)
fmt.Printf("accessing this: %v\n", CkanEndpoint)
resp, err := http.Get(CkanEndpoint)
if err != nil {
panic(err)
}
b, _ := io.ReadAll(resp.Body)
//will take a while for this function to return so close the connection now as it's no longer needed
resp.Body.Close()
fmt.Printf("%v", string(b))
var r CkanQueryResult
json.Unmarshal(b, &r)
for _, result := range r.Result.Results {
if !strings.HasSuffix(result.Title, "登記所備付地図データ") {
continue
}
for _, resource := range result.Resources {
if !strings.EqualFold(resource.Format, "ZIP") {
continue
}
ch <- resource.URL
fmt.Printf("sending this over the channel %v\n", resource.URL)
}
}
fmt.Printf("done sending!!\n")
}()
return ch
}
// returns io.ReadCloser of each XML in the zipped file over the channel
func fetchXML(urls chan string) chan *F {
fmt.Printf("fetching zips and send XMLs\n")
ch := make(chan *F)
rateLimiter := make(chan struct{}, MAX_PARALLEL_FETCH_AND_UNZIP)
for i := 0; i < MAX_PARALLEL_FETCH_AND_UNZIP; i++ {
rateLimiter <- struct{}{}
}
fmt.Printf("filled the ratelimiter for fetchXML\n")
go func() {
fmt.Printf("inside the closure in fetchXML")
var wg sync.WaitGroup
defer fmt.Printf("this containing go routine must exist after all the XMLs have been sent over the channel. Done sending!\n")
defer close(ch)
defer wg.Wait()
for url := range urls {
wg.Add(1)
<-rateLimiter
go func(url string) {
defer wg.Done()
fmt.Printf("fetching this over the internet %v\n", url)
resp, err := http.Get(url)
if err != nil {
log.Fatalf("something went wrong when making an http request to the server%v\n", err)
}
var buf bytes.Buffer
_, err = io.Copy(&buf, resp.Body)
//since the response body has been copied to the memory, close the http connection now
//(will take a while until this function returns (= have finished unzipping all the XMLs ))
resp.Body.Close()
if err != nil {
log.Fatalf("something went wrong when loading the response stream onto memory %v\n", err)
}
fmt.Printf("making another goroutine for recursive unzipping!\n")
wg.Add(1)
go func() {
fmt.Printf("within the unzipping section. this section is not done in parallel as it will overcomplicate things\n")
defer wg.Done()
//since the zip file is nested, use recursion to unzip everything contained in the file.
//I want the unzipping to happen sequentially
//(doing in parallel would add more complexity while you wouldn't get any benefit since they're already in the memory)
//so the unzipping part is in a separate function without any children goroutines
extractFileToChan(bytes.NewReader(buf.Bytes()), int64(buf.Len()), ch)
}()
rateLimiter <- struct{}{}
}(url)
}
}()
return ch
}
// I needed a name for recursion (you can't do recursion on anonymous functions)
func extractFileToChan(r io.ReaderAt, s int64, ch chan *F) {
zipReader, err := zip.NewReader(r, s)
if err != nil {
log.Fatalf("could not open the zip file requested %v", err)
}
for _, file := range zipReader.File {
if strings.HasSuffix(file.Name, ".zip") {
fileReader, err := file.Open()
if err != nil {
log.Fatalf("could not open the nested zip file %v", err)
}
var buf bytes.Buffer
_, err = io.Copy(&buf, fileReader)
fileReader.Close()
if err != nil {
log.Fatalf("could not load the zip file onto memory %v", err)
}
extractFileToChan(bytes.NewReader(buf.Bytes()), int64(buf.Len()), ch)
continue
}
if !strings.HasSuffix(file.Name, ".xml") {
continue
}
//finally grabbed the XML file!
fileReader, err := file.Open()
if err != nil {
log.Fatalf("something went wrong when opening the XML file within the zip file %v", err)
}
ch <- &F{Content: fileReader, Name: file.Name}
}
}
// example implementation of io.ReadCloser being saved to the disk.
// you can create your own implementation such as toS3 using https://github.com/aws/aws-sdk-go-v2/blob/service/s3/v1.66.2/service/s3/api_op_PutObject.go#L117
// *io.Reader is included in io.ReadCloser so you can directly use F.Content for upload
func toDisk(fs chan *F) chan struct{} {
ch := make(chan struct{})
rateLimiter := make(chan struct{}, MAX_PARALLEL_WRITE)
for i := 0; i < MAX_PARALLEL_WRITE; i++ {
rateLimiter <- struct{}{}
}
go func() {
var wg sync.WaitGroup
//wait until all the items in the queue has been processed and then close the channel.
defer fmt.Printf("toDisk channel is closed!\n")
defer close(ch)
//this goroutine must exit after all the other children goroutines have exited otherwise the channel will be closed prematurely
defer wg.Wait()
for f := range fs {
wg.Add(1)
<-rateLimiter
go func(f *F) {
defer wg.Done()
fileReader := f.Content
path := f.Name
newFile, err := os.Create(path)
if err != nil {
log.Fatalf("something went wrong when creating the file onto which the content of XML was going to be written %v", err)
}
_, err = io.Copy(newFile, fileReader)
fileReader.Close()
newFile.Close()
//if you want to add another function you might actually want to send something
ch <- struct{}{}
if err != nil {
log.Fatalf("could not load the XML file onto memory %v", err)
}
rateLimiter <- struct{}{}
}(f)
}
}()
return ch
}
const CkanEndpoint string = `https://www.geospatial.jp/ckan/api/3/action/package_search?q=(tags:%E6%B3%95%E5%8B%99%E7%9C%81%20AND%20tags:%E5%9C%B0%E5%9B%B3%E6%83%85%E5%A0%B1)&rows=999999`
type F struct {
Content io.ReadCloser
Name string
}
type CkanQueryResult struct {
Help string `json:"help"`
Success bool `json:"success"`
Result struct {
Count int `json:"count"`
Facets struct {
} `json:"facets"`
Results []struct {
Area string `json:"area"`
Author string `json:"author"`
AuthorEmail string `json:"author_email"`
Charge string `json:"charge"`
CreatorUserID string `json:"creator_user_id"`
Emergency string `json:"emergency"`
Fee string `json:"fee"`
ID string `json:"id"`
Isopen bool `json:"isopen"`
LicenseAgreement string `json:"license_agreement"`
LicenseID string `json:"license_id"`
LicenseTitle string `json:"license_title"`
Maintainer string `json:"maintainer"`
MaintainerEmail string `json:"maintainer_email"`
MetadataCreated string `json:"metadata_created"`
MetadataModified string `json:"metadata_modified"`
Name string `json:"name"`
Notes string `json:"notes"`
NumResources int `json:"num_resources"`
NumTags int `json:"num_tags"`
Organization struct {
ID string `json:"id"`
Name string `json:"name"`
Title string `json:"title"`
Type string `json:"type"`
Description string `json:"description"`
ImageURL string `json:"image_url"`
Created string `json:"created"`
IsOrganization bool `json:"is_organization"`
ApprovalStatus string `json:"approval_status"`
State string `json:"state"`
} `json:"organization"`
OwnerOrg string `json:"owner_org"`
Private bool `json:"private"`
Quality string `json:"quality"`
RegisterdDate string `json:"registerd_date"`
Restriction string `json:"restriction"`
Spatial string `json:"spatial"`
State string `json:"state"`
ThumbnailURL string `json:"thumbnail_url"`
Title string `json:"title"`
Type string `json:"type"`
URL interface{} `json:"url"`
Version interface{} `json:"version"`
Extras []struct {
Key string `json:"key"`
Value string `json:"value"`
} `json:"extras"`
Resources []struct {
CacheLastUpdated interface{} `json:"cache_last_updated"`
CacheURL interface{} `json:"cache_url"`
Created string `json:"created"`
DatastoreActive bool `json:"datastore_active"`
Description string `json:"description"`
Format string `json:"format"`
Hash string `json:"hash"`
ID string `json:"id"`
LastModified interface{} `json:"last_modified"`
MetadataModified string `json:"metadata_modified"`
Mimetype interface{} `json:"mimetype"`
MimetypeInner interface{} `json:"mimetype_inner"`
Name string `json:"name"`
PackageID string `json:"package_id"`
Position int `json:"position"`
ResourceType interface{} `json:"resource_type"`
Size interface{} `json:"size"`
State string `json:"state"`
URL string `json:"url"`
URLType interface{} `json:"url_type"`
} `json:"resources"`
Tags []struct {
DisplayName string `json:"display_name"`
ID string `json:"id"`
Name string `json:"name"`
State string `json:"state"`
VocabularyID interface{} `json:"vocabulary_id"`
} `json:"tags"`
Groups []interface{} `json:"groups"`
RelationshipsAsSubject []interface{} `json:"relationships_as_subject"`
RelationshipsAsObject []interface{} `json:"relationships_as_object"`
} `json:"results"`
Sort string `json:"sort"`
SearchFacets struct {
} `json:"search_facets"`
} `json:"result"`
}
Alright then what's the beef with the example implementation?
Disclaimer
I am not good at Python as I am rarely so compelled to use it that the defects above are justifiable.
Without further ado~
Yes we downloaded those files, that's great! But we can't call it a day yet. Why?
We probably want to convert those files to a more generic format such as GeoJSON so that we can for example overlay them on a map.
A cursory Google search should lead you to this:
mojxml2geojson
Data converter for the National Land Register data (mojxml).
What is the National Land Register data?
The conversion specifications are as follows.
Extracts and outputs only the brush polygon data and attributes necessary to maintain the Address Base Registry from the Map XML data. Reference points, boundary points, and boundary lines are not output.
For public coordinate information data, convert coordinate values to longitude and latitude (JGD2011). Add representative point coordinates as attributes.
Data in arbitrary coordinate information are not converted to coordinate values.
Requirement
GDAL
https://gdal.org/download.html
python 3.*
pip 22.*
I'm hoping I'm not the only person feeling this, but at a glance, this gives me more questions than answers:
Why is this in Python?
I already grumbled enough.
Why is the readme.md
in English?
and yes, this is exactly the reason why I wrote this article in English, despite the fact that the targeted audience is Japanese people. It's not that I am not so good at English that I can't read documentations written in English, it just doesn't make sense.
What is the National Land Register data?
is a link to one of their NOTE post.
which by itself I do not have any issues with, but I assume the post is in Japanese, so then again, why did you write this in English? Plus the link is broken. What did you want to achieve?
The explanation itself is not good enough
My first language is not English either so I don't want to comment on the fluency part, but this is too much. Ostensible "they don't know their sh*t"
Translations of some of the technical words are incorrect. Let me give you some examples.
- 筆: brush (should be parcel; just think about it, you can say "a patch of land" or "a tract of land" and of course "a parcel of land" but brush..??? and after three mininutes it occurs to you that the Japanese word for it is 筆 and you realise it is probably a direct translation, at which point the documentation being in English is totally meaningless because you need to know both languages anyway!!!! )
- 座標系: coordinate information data (should be coordinate reference system; probably lacking an extremely rudimentary understanding of how surveying works. It's not about their English itself, the writer doesn't understand what they are writing.)
- you can always refer to how other countries/areas do this!!!! We have internet!!!
-
https://dirsig.cis.rit.edu/docs/new/coordinates.html
- from this you can tell "arbitrary coordinate system" is the word
-
https://www.gov.uk/guidance/uk-geospatial-data-standards-coordinate-reference-systems
- making a localised CRS is not weird!!!! Just name it!!
-
https://dirsig.cis.rit.edu/docs/new/coordinates.html
- you can always refer to how other countries/areas do this!!!! We have internet!!!
At this point, you realise this project is self-congratulatory, without a tiniest thought of the end user, and start to fear what calamity might await you.
What is this ugly interface?
No support for multiple files??
I believe it is quite natural to assume not many people want to convert one single file; a zipped file contains multiple XMLs and they could be downloading multiple zipped files to get data of a certain region or a prefecture. Why does this program convert data by each XML file, rather than the entire zip file?
Why is this even in CLI?
As I mentioned, not a lot of people want to convert one single file so they will want to add loops. The most naive implementation should look like this:
import os
def convert(xml_file):
pass
def list_xml_files(directory):
files_in_directory = os.listdir(directory)
xml_files = [file for file in files_in_directory if file.endswith(".xml")]
return xml_files
directory_path = '/path/to/your/directory'
xml_files = list_xml_files(directory_path)
for xml_file in xml_files:
convert(xml_file)
and almost certainly people in their right minds want to make the conversion parallel because each input and output is separate, it's easily parallelisable. Python itself is a single-threaded language but you can utilise multiprocessing
module to increase the number of Python instances and achieve some level of parallelism (I personally would rather use a more decent language if I needed parallelism but, hey, this project is in Python), and your code should look somewhat like this:
import os
from multiprocessing import Pool
def list_xml_files(directory):
files_in_directory = os.listdir(directory)
xml_files = [file for file in files_in_directory if file.endswith(".xml")]
return xml_files
def convert(xml_file):
pass
if __name__ == '__main__':
directory_path = '/path/to/your/directory'
xml_files = list_xml_files(directory_path)
# HERE↓
with Pool(10) as pool:
pool.map(convert, xml_files)
But thanks to the way this program is packaged, you can do neither of these without some good amount of modification (wrapping it with with Pool
is the easiest part, you need to make sure there's no race condition) , and you actually don't want to make it, because you need to do the same modification again every time there is an update to the library, and the dumb mistakes you make, you don't notice until you run the program because it's a dynamic language!!! Why did you not simply upload it to pip so it can be imported with an import statement??? I fail to comprehend your intention. I feel malice.
Okay it could well be a CLI but why via a file?
Yes this is a huge compromise and at this point I'm starting to think I'm not using a library the library is using me. Well. At least can we do without a physical file? Writing to a disk is expensive so maybe we can pass the content of the file via the stdin
and we're cool? No...?
It's actually quite easy to make this modification. You just need to change the variables that go into open()
but you know a library is poorly abstracted when you need to modify the code directly.
import sys
# ↓Specify the fileno of the stdin
with open(sys.stdin.fileno(), encoding="utf-8") as f:
print("foobarbazbaz")
print(f.read())
#!/bin/bash
echo "hello" | python3 main.py
# should get:
# foobarbazbaz
# hello
# you can still read from a file using stdin redirection
# Write the word "hello" to a file fffff
echo "hello" > fffff
# Use stdin redirection to get the input from the file fffff
python3 main.py < fffff
Good luck to people using Windows because stdin redirection <
is not supported in Powershell yet so you will have to use Get-Content
and print the whole thing to stdout
and then use piping, or use WSL (Windows Sybsystem for Linux) to use bash. (using WSL means you need to enable virtualisation, I hope the IT is okay with that! (*ここのITは「情シス」です)
We wanted to do one single easy thing, and now we're getting into this whole mess.
Wait, why do you have Requirement section..?
Dependency information of this kind usually go to a file. In the case of JavaScript, it's package.json
, in Python it should be requirements.txt
. Then you notice requirements.txt
is not included in the repository.
*Actually when you clone the entire repo, there is a requirements.txt
hidden within mojxml2geojson.egg-info
and at which point you need to understand what egg-info
and other dependency management stuff contained within the directory are, and I personally have no idea because I don't use Python!!!!)
Then the great runtime error
Subduing all the frustration that has accumulated, I decided to bring myself to at least install the program as instructed so I, quite naturally, did this:
git clone https://github.com/digital-go-jp/mojxml2geojson
cd mojxml2geojson
pip install .
which gave me an avalanche of error logs:
copying gdal-utils\osgeo_utils\samples\__init__.py -> build\lib.win-amd64-cpython-310\osgeo_utils\samples
running egg_info
writing gdal-utils\GDAL.egg-info\PKG-INFO
writing dependency_links to gdal-utils\GDAL.egg-info\dependency_links.txt
writing entry points to gdal-utils\GDAL.egg-info\entry_points.txt
writing requirements to gdal-utils\GDAL.egg-info\requires.txt
writing top-level names to gdal-utils\GDAL.egg-info\top_level.txt
reading manifest file 'gdal-utils\GDAL.egg-info\SOURCES.txt'
writing manifest file 'gdal-utils\GDAL.egg-info\SOURCES.txt'
running build_ext
building 'osgeo._gdal' extension
building 'osgeo._gnm' extension
building 'osgeo._ogr' extension
building 'osgeo._gdal_array' extension
building 'osgeo._osr' extension
building 'osgeo._gdalconst' extension
creating build\temp.win-amd64-cpython-310\Release\extensions
"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\Users\$Env:UserName\AppData\Local\Programs\Python\Python310\include -IC:\Users\$Env:UserName\AppData\Local\Programs\Python\Python310\Include -IC:\Users\$Env:UserName\AppData\Local\Temp\pip-build-env-jpcojkdw\overlay\Lib\site-packages\numpy\_core\include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" /EHsc /Tpextensions/ogr_wrap.cpp /Fobuild\temp.win-amd64-cpython-310\Release\extensions/ogr_wrap.obj -DSWIG_PYTHON_SILENT_MEMLEAK
"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\Users\$Env:UserName\AppData\Local\Programs\Python\Python310\include -IC:\Users\$Env:UserName\AppData\Local\Programs\Python\Python310\Include -IC:\Users\$Env:UserName\AppData\Local\Temp\pip-build-env-jpcojkdw\overlay\Lib\site-packages\numpy\_core\include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" /EHsc /Tpextensions/gnm_wrap.cpp /Fobuild\temp.win-amd64-cpython-310\Release\extensions/gnm_wrap.obj -DSWIG_PYTHON_SILENT_MEMLEAK
"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\Users\$Env:UserName\AppData\Local\Programs\Python\Python310\include -IC:\Users\$Env:UserName\AppData\Local\Programs\Python\Python310\Include -IC:\Users\$Env:UserName\AppData\Local\Temp\pip-build-env-jpcojkdw\overlay\Lib\site-packages\numpy\_core\include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" /EHsc /Tpextensions/gdal_wrap.cpp /Fobuild\temp.win-amd64-cpython-310\Release\extensions/gdal_wrap.obj -DSWIG_PYTHON_SILENT_MEMLEAK
"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\Users\$Env:UserName\AppData\Local\Programs\Python\Python310\include -IC:\Users\$Env:UserName\AppData\Local\Programs\Python\Python310\Include -IC:\Users\$Env:UserName\AppData\Local\Temp\pip-build-env-jpcojkdw\overlay\Lib\site-packages\numpy\_core\include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" /EHsc /Tpextensions/gdal_array_wrap.cpp /Fobuild\temp.win-amd64-cpython-310\Release\extensions/gdal_array_wrap.obj -DSWIG_PYTHON_SILENT_MEMLEAK
"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\Users\$Env:UserName\AppData\Local\Programs\Python\Python310\include -IC:\Users\$Env:UserName\AppData\Local\Programs\Python\Python310\Include -IC:\Users\$Env:UserName\AppData\Local\Temp\pip-build-env-jpcojkdw\overlay\Lib\site-packages\numpy\_core\include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" /Tcextensions/gdalconst_wrap.c /Fobuild\temp.win-amd64-cpython-310\Release\extensions/gdalconst_wrap.obj -DSWIG_PYTHON_SILENT_MEMLEAK
"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\Users\$Env:UserName\AppData\Local\Programs\Python\Python310\include -IC:\Users\$Env:UserName\AppData\Local\Programs\Python\Python310\Include -IC:\Users\$Env:UserName\AppData\Local\Temp\pip-build-env-jpcojkdw\overlay\Lib\site-packages\numpy\_core\include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\cppwinrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" /EHsc /Tpextensions/osr_wrap.cpp /Fobuild\temp.win-amd64-cpython-310\Release\extensions/osr_wrap.obj -DSWIG_PYTHON_SILENT_MEMLEAK
ogr_wrap.cpp
gdal_array_wrap.cpp
gdal_wrap.cpp
gdalconst_wrap.c
osr_wrap.cpp
gnm_wrap.cpp
extensions/gdalconst_wrap.c(3237): fatal error C1083: Cannot open include file: 'gdal.h': No such file or directory
extensions/gdal_array_wrap.cpp(3380): fatal error C1083: Cannot open include file: 'gdal.h': No such file or directory
extensions/gnm_wrap.cpp(3377): fatal error C1083: Cannot open include file: 'gdal.h': No such file or directory
extensions/gdal_wrap.cpp(3452): fatal error C1083: Cannot open include file: 'cpl_port.h': No such file or directory
extensions/osr_wrap.cpp(3435): fatal error C1083: Cannot open include file: 'cpl_string.h': No such file or directory
extensions/ogr_wrap.cpp(3406): fatal error C1083: Cannot open include file: 'gdal.h': No such file or directory
error: command 'C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.41.34120\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
It's not that gdal is not installed at all, there are some .h
files missing in my installation and if I need to get it resolved, I will have to go web-surfing for maybe a few hours and I'm not prepared to do that on this particular occasion.
You see, one of the mottos of the DA is "Don't reinvent the wheel" and it means if everybody tries to solve the same problem over and over, we as a society cannot propel forward. It might be a trivial example but isn't this exactly that? I am not here to meddle with these dependency issues, I just want to convert those files. Give me a break.
What is a possible way out after all this?
- the "correct" way
- GDAL is a widely used, the de-facto standard library when processing geometries, and you should be able to use it in languages like Java, C#, Go as they are ported into different languages by different companies/people
- One of them is godal by Airbus
- The conversion of coordinates between CRS's are done by GDAL so the code in this repository should be mainly about the mapping of values from the XML to the output JSON, meaning it shouldn't be too complicated.
- Yes I mean you should rewrite it! When the wheel doesn't spin, it's better to replace it with a new one than make do with it. It's more fuel-efficient in the long run.
- As I mentioned above, Go produces a single executable, meaning it's one single
.exe
, in the case of Windows, that contains all the dependencies. You can give them the executable and that's it! Go supports cross-compiling, so you don't have to own an AMD to compile the project for AMDs.- We were worried whether the IT would allow us to enable virtualisation, what was that?!
- As I mentioned above, Go produces a single executable, meaning it's one single
- the "bodgy" way
-
git clone
it, add interfaces using celery so the conversion is done in parallel and you can add tasks from other languages (as you can see in the official docs, the task queue is implemented with things like Redis, so as long as you meet the protocol, languages don't matter, and there's already a wrapper for Go!)- This obviously introduces a few additional components in the infrastructure, so you will be writing your own
docker-compose.yml
.
- This obviously introduces a few additional components in the infrastructure, so you will be writing your own
- I strongly doubt it is sustainable. When you make a prototype, that becomes the baseline. I can see myself maintaining an Howl's Moving Castle for the sole purpose of making fried eggs in the near future.
-
Another consideration you could have made
Suppose you want to download those zip files, unzip them until you hit XMLs and read them (all in-memory, without writing them on the disk) and convert records whose CRS's are not arbitrary coordinate systems. Since you can't physically locate them on a map, they're not worth converting and since Python is slow, we would rather pull them out from the input than from the output...
⇒ Sorry the CRS is specified per file, so you can easily do that from the code you use to download those zip files.
Lessons
Think of the end user. All the efforts you put into the project are meaningless when it's not even used.